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This  work  addresses  two  related  questions.  The  first  question  is  what  joint  time- 
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shift-invariance,  (2)  positivity,  (3)  superposition,  (4)  locality,  and  (5)  smoothness. 
Several  relations  among  these  properties  are  proved:  shift-invariance  and  positivity 
impiy  the  transform  is  a  superposition  of  spectrograms;  positivity  and  superposition 
are  equivalent  conditions  when  the  transform  is  real;  positivity  limits  the  simulta¬ 
neous  time  and  frequency  resolution  (locality)  possible  for  the  transform,  defining 
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ity  and  smoothness  tradeoff  by  the  2-D  generalization  of  the  classical  uncertainty 
relation.  The  transform  that  best  meets  these  criteria  is  derived,  which  consists 
of  two-dimensionally  smoothed  Wigner  distributions  with  (possibly  oriented)  2-D 
gaussian  kernels.  These  transforms  are  then  related  to  time-frequency  filtering,  a 
method  for  estimating  the  time-varying  ‘transfer  function’  of  the  vocal  tract,  which 
is  somewhat  analogous  to  cepstral  filtering  generalized  to  the  time-varying  case. 
Natural  speech  examples  are  provided. 

The  second  question  addressed  is  how  to  obtain  a  rich,  symbolic  description  of  the 
phonetically  relevant  features  in  these  time-frequency  energy  surfaces,  the  so-called 
schematic  spectrogram.  Time-frequency  ridges,  the  2-D  smalog  of  spectral  peaks, 
are  one  feature  that  is  proposed.  If  non-oriented  kernels  are  used  for  the  energy 
representation,  then  the  ridge  tops  can  be  identified  with  zero-crossings  in  the  inner 
product  of  the  gradient  vector  and  the  direction  of  greatest  downward  curvature. 
If  oriented  kernels  are  used,  the  method  can  be  generalized  to  give  better  orien¬ 
tation  selectivity  (e.g.,  at  intersecting  ridges)  at  the  cost  of  poorer  time-frequency 
locality.  Many  speech  examples  are  given  showing  the  performance  for  some  tra¬ 
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Chapter  1. 
Introduction 


In  order  to  perceive  speech  and  other  sounds,  the  incoming  sound  wave  must  be 
transformed  into  a  variety  of  representations,  each  bringing  forth  different  aspects 
of  the  signal,  its  source,  and  meaning.  Understanding  how  we  perceive  and  how 
machines  can  be  made  to  perceive  auditory  signals  means,  in  part,  discovering 
appropriate  representations  for  the  signals  and  how  to  compute  them.  For  many 
kinds  of  sounds,  little  is  known  in  this  respect.  What  auditory  features,  for  example, 
will  distinguish  a  knock  at  the  door  from  a  footstep? 

For  speech  signals,  more  is  thought  to  be  known.  A  phonetician  will  tell  you,  for 
example,  that  the  /ae/  in  bad  can  be  distinguished  from  the  /i/  in  bead  by  the 
location  of  characteristic  peaks  in  their  respective  spectra.  He  could  even  train  you 
to  identify  a  wide  variety  of  phonetic  elements  by  looking  at  their  spectrograms. 
Formalizing  this  knowledge,  however,  so  that  a  computer  can  do  this  well  (in  a 
general  setting)  has  proved  hard. 

An  analogy  may  explain  why.  I  could  train  you  to  distinguish  a  Mercedes  from  some 
other  car  easily;  I  would  just  describe  the  hood  ornament,  f  To  train  a  machine 


I  thank  Mark  Liberman  for  this  example. 
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t  to  do  this  task  would  be  much  harder.  Not  only  would  I  have  to  describe  the  hood 

ornament,  but  I  would  also  have  to  provide  all  the  visual  abilities  that  I  take  for 
granted  with  a  hriman  —  finding  edges  and  boundaries,  recognizing  closed  forms, 
etc.  I  believe  the  failure  to  correctly  provide  the  corresponding  auditory  abilities 

I 

—  finding  spectral  “peaks”  and  temporal  discontinuites,  recognizing  continuous 
forms,  etc.  —  is  an  important  reason  why  the  speech  recognition  problem  has  been 
so  difficult. 

This  problem  is  in  some  ways  even  harder  than  visual  analysis.  In  vision,  it  is  clear 
that  the  two-dimensional  image  is  a  natural  starting  point.  In  audition,  a  similar  2D 
representation  is  important,  with  time  along  one  axis  and  frequency  along  the  other. 
But  how  should  this  idea  be  made  precise  (the  well-known  uncertainty  principle  of 
fourier  analysis  is  one  of  the  thorny  issues  involved)?  Should  we  use  the  conventional 
spectrogram,  the  Wigner  distribution,  a  pseudo-auditory  spectrogram,  or  something 
entirely  new,  and  how  should  this  decision  be  made? 

In  vision,  the  notion  of  edges,  lines,  and  so  forth  obviously  are  important  features 
of  an  image.  In  audition,  it  is  harder  to  decide  what  are  the  appropriate  primitive 
elements.  Can  some  symbolic  description  summarize  the  relevant  features  of  a 
sound’s  time-frequency  representation  analogous  to  how  a  line  drawing  summarizes 
an  image? 

These  questions  about  the  early  steps  in  auditory  processing  are  the  topic  of  this 
thesis.  The  emphasis  will  be  on  speech  signals  primarily  because  the  intermediate 
goals  to  which  the  initial  computations  must  aim  are  better  understood.  I  believe, 
nevertheless,  that  many  of  the  auditory  processing  issues  discussed  here  are  also 
relevant  for  other  kinds  of  sounds. 
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•  The  topic  as  stated  is  still  too  broad.  Speech  and  other  signals  are  made  up  of  many 

different  kinds  of  components.  For  instance,  speech  haa  fairly  smoothly  changin,^ 
vocalic  regions  that  are  quite  different  from  the  more  discontinuous  structure  of 
consonantal  regions.  It  is  unlikely  that  the  same  initial  representations  will  be 

I 

appropriate  for  every  kind  of  signal.  The  emphasis  here  will  be  on  signals  like  those 
found  in  the  more  continuous,  sonorant  regions  of  speech. 

In  the  sonorant  regions,  we  find  an  apparent  feature  is  local  spectral  energy  con¬ 
centrations  that  vary  in  center  frequency  with  time.  These  peaks  are  due,  in  part, 
to  the  “resonances”  of  the  vocal  tract  -  the  so-called  formants.  The  formant  loca¬ 
tions  (labelled  F1,F2,. ..  in  order  of  increasing  frequency)  specify  the  general  vowel 
quality,  r-coloring  and  roundness,  while  the  formant  transitions  between  consonants 
and  vowels  play  an  important  role  in  consonant  identification  [see  e.g.  Chiba  k  Ka- 
jiyama  1941;  Fant  1960;  Liberman,  et  al  1954;  Ladefoged  1975j.  A.  Liberman,  in 
fact,  claims  that  .  .the  second  formant  transition. .  .is  probably  the  single  most  im¬ 
portant  carrier  of  linguistic  information  in  the  speech  signal  [Liberman,  et  al  1967;. 
Thus,  restricting  the  discussion  to  these  regions  is  by  no  means  uninteresting. 

The  initial  speech  processing  envisioned  here  has  been  divided  into  two  steps.  The 
first  step,  which  produces  a  joint  time-frequency  representation  of  the  signal  energy, 
is  explored  in  Chapter  2  and  Chapters.  The  second  step,  which  produces  a  symbolic 
representation  that  captures  the  acoustically  relevant  features  present  in  the  joint 
time-frequency  energy  representation,  is  explored  in  Chapter  4  (see  Figure  1.1). 

One  of  the  most  difficult  problems  in  deriving  the  form  of  such  representations  is 
deciding  which  properties  or  ^ucioms  to  assume  at  the  outset.  If  strong  assumptions 
are  made  about  the  received  signal,  then  rigorously  defined  optimal  detection  can 


Figure  1.1.  The  initial  speech  processing  is  seen  as  divided  into  two  steps,  (a)  The 
first  step  represents  the  signal  energy  as  joint  functions  of  time  and  frequency,  (b) 
The  second  step  builds  a  symbolic  representation  of  the  signiScant  features  present 
in  the  joint  time-frequency  energy  representations.  At  this  step,  which  we  call  the 
schematic  spectrogram,  there  is  no  undue  commitment  to  the  acoustic  origin 
of  the  features  represented;  it  is  a  description  of  the  signal,  not  its  sources,  (c) 
In  subsequent  processing,  these  initial  descriptions  can  be  used  to  decompose  the 
signal  into  its  acoustic  sources. 


Ch.  1.  Introduction 


13 


result.  For  example,  if  we  assume  that  the  received  signal  consists  solely  of  a 
known  signal  in  additive  Gaussian  noise,  then  we  could  build  a  matched  filter  that 
performs  optimal  Bayesian  detection  (e.g.,  see  Van  Trees  1968].  The  disadvantage 
of  such  strong  assumptions  is  that  they  are  seldom  universally  valid  for  natural 
perceptual  signals. 

On  the  other  hand,  weaker  assumptions  made  about  the  received  signal  can  be  com¬ 
bined  with  assumptions  about  the  design  of  the  representation,  things  like  linearity, 
continuity,  locality,  and  stability,  that  can  result  in  a  solution  [cf.  Marr  &  Nishi- 
hara|.  These  design  criteria  are  chosen  not  on  the  basis  of  a  specific  signal  model, 
but  instead  as  reasonable  choices  that  should  be  appropriate  for  a  wide  range  of 
natural  signals.  The  disadvantage  of  this  approach  is  that  the  justification  of  the 
design  decisions  is  more  intuitive  and  abstract. 

In  the  best  of  circumstances,  the  two  approaches  would  result  in  the  same  or  similar 
solutions  to  a  problem.  Thus  the  auditory  processing  would  perform  optimally  (in 
different  senses)  when  both  appropriate  weak  and  strong  assumptions  are  made 
about  the  received  signal. 

Chapter  2  derives  those  joint  time-frequency  energy  representations  that  satisfy 
a  small  set  of  desirable  properties;  these  properties  are  intentionally  kept  quite 
general.  Chapter  3  re-examines  this  problem  in  a  more  specific  setting.  Given  a 
(time-varying)  model  of  speech  production,  what  time-frequency  representation  of 
the  signal  best  depicts  the  ‘transfer  function’  of  the  vocal  tract  while  suppressing 
the  excitation.  These  two  approaches,  in  fact,  yield  similar  solutions. 

In  the  initial  part  of  Chapter  4,  a  general,  heuristic  argument  is  used  to  produce  a 
phonetically  relevant,  symbolic  representation  of  the  signal.  In  a  later  part,  these 
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solutions  are  briefly  related  to  a  signal  detection  model. 
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In  Chapter  5,  we  look  at  a  wide  range  of  examples  using  these  proposed  methods. 
We  examine  some  traditionally  difficult  speech  cases  —  glides  and  semi-vowels, 
nasalized  vowels,  consonant-vowel  transitions,  female  speech,  and  imperfect  trans¬ 
mission  channels. 

N.B.:  For  the  Sgures  in  this  thesis,  time  is  in  seconds,  frequency  in 
Hertz,  and  energy  in  decibels,  unless  otherwise  indicated. 


Chapter  2. 

The  Time-Frequency 
Energy  Representation 

This  chapter  explores  the  design  of  joint  time-frequency  energy  representations  for 
speech  signals.  A  set  of  desirable  properties  for  such  representations  to  satisfy  is 
proposed,  and  the  relationships  among  these  properties  is  discussed.  This  includes 
a  general  treatment  of  the  ‘uncertainty’  relations  that  arise.  The  signal  transforms 
that  best  satisfy  these  properties  are  then  derived  and  examined. 

2.1.  The  stationary  case 

We  begin  with  an  analysis  of  the  special  case  of  stationary  signals.  There  is  a  large 
literature  for  this  case;  Rabiner  Sc  Schafer  [1978]  and  Flanagan  |1972j  provide  good 
reviews.  The  discussion  of  it  here  is  very  condensed  and  confined  to  topics  that  are 
relevant  to  the  sequel. 

A  stationary  signal  is  used  here  to  roughly  mean  a  signal  whose  frequency  content 
does  not  vary  with  time.  More  precisely,  we  consider  only  determinstic  signals  that 
are  periodic  and  random  signals  that  are  correlation-stationary.  For  both  kinds 
of  signals,  the  power  spectrum,  the  fourier  transform  of  the  autocorrelation  fnnc- 
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tion,  captures  naturally  the  energy  present  at  each  frequency,  f  Time  is  removed 
from  this  representation;  the  power  spectrum  is  a  one-dimensional  representation 
of  energy  as  a  function  of  frequency. 

For  speech  signsds  there  are,  of  course,  no  completely  stationary  signals.  We  can, 
however,  deliberately  utter  vowels  so  that  they  are  steady-state  for  as  long  as  we 
like.  Figure  2.1  shows  the  spectrum  of  a  'ong  duration,  voiced  /j/.  We  find  in  the 
spectrum  many  of  the  characteristic  features  of  a  steady-state  vowel. 

Let  us  examine  the  spectrum  in  Figure  2.1.  Note  the  y-axis  is  logarithmic  to  com¬ 
press  the  wide  dynamic  range  of  the  speech.  At  a  fine  scale  in  this  spectrum,  there 
are  peaks  spaced  about  every  hundred  Hertz;  these  are  the  harmonics  of  the  pitch. 
The  somewhat  larger  scale  peaks,  of  a  few  hundred  Hz  bandwidth,  are  the  formant 
peaks.  The  peak  at  about  300  Hz  is  FI  and  the  peak  at  about  2300  Hz  is  F2,  which 
is  characteristic  of  an  /i/  vowel  for  an  adult  male.  Still  larger  scale  shaping  of  the 
spectrum,  so  called  spectral  balance,  is  due  to  the  formant  locations,  the  nature  of 
the  voicing  and  the  transmission  channel. 

The  spectral  structure  of  a  vowel,  therefore,  is  due  acoustically  to  several  factors: 
(1)  the  vocal  excitation  —  e.g.,  voiced;  (2)  the  vocal  tract  transfer  function,  char¬ 
acterized  by  its  resonant  frequencies  —  the  formants,  and  (3)  the  transmission 
characteristics  —  e.g.,  room  acoustics.  Determining  these  factors  from  the  speech 
(i.e.,  finding  the  formant  frequencies,  the  pitch,  etc.)  is  an  important  intermediate 
step  in  speech  analysis,  since  they  decompose  the  signal  into  components  of  nearly 
independent  origin,  and  are  (thus)  starting  points  for  the  phonetician’s  description 
of  speech  signal. 

t  For  a  deUrminutic  lignal  z(t),  iU  autocorrelation  function  is  /  z((  +  rji'lt)  dt,  and  for  a  stationary 
random  process  y(t),  its  autocorrelation  function  is  £\y{t  +  r)i/*(t){. 
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Figure  2.1.  Short-time  log  spectrum  of  a  steady-state  /i/.  The  finest  scale  struc¬ 
ture  corresponds  to  the  harmonics  of  the  pitch,  spaced  about  every  100  Hz.  At  an 
intermediate  scale  are  the  formant  peaks;  e.g.,  Fl  at  300  Hz  and  F2  at  2300  Hz.  At 
the  largest  scale  is  the  overall  spectral  balance. 


Figure  2.2.  Spectrum  in  Figure  2.1  smoothed  to  suppress  the  excitation,  (a) 
Log  spectrum  convolved  with  gaussian  (cepstral  smoothing),  (b)  Power  spectrum 
convolved  with  gaussian  (and  then  transformed  to  a  log  scale). 
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A  key  point  in  separating  these  factors  in  the  speech  signal  is  that  they  operate 
at  somewhat  different  scales  in  its  spectrum;  the  fine  scale  structure  is  due  mostly 
to  the  excitation,  while  the  intermediate  scale  structure  is  due  to  ihe  vocal  tract 
•  transfer  fimction.  A  common  technique  for  selecting  a  scale  of  interest  is  to  smooth 

the  spectrum  by  linear  convolution,  or  equivalently,  to  window  the  fourier  transfo’-^-i 
of  the  spectrum.  The  fourier  transform  of  the  log  spectrum  is  called  the  cepstrum,  its 
dimension  quefrencies,  and  the  smoothing  performed  cepstra/  smoothing  or  liftering. 
[Oppenheim  1969;  Oppenheim  k  Shafer  1975],  Figure  2.2a  shows  the  spectrum  in 
Figure  2.1  after  it  has  been  cepstrally  smoothed  at  a  scale  to  emphasize  the  formants, 
and  suppress  the  excitation.  We  shall  see  in  Chapter  3  that  this  operation,  in  fact, 
effectively  separates  excitation  from  transfer  function  in  certain  idealized,  stationary 
cases. 

It  is  smoothing  the  power  spectrum,  not  its  logarithm,  that  most  easily  generalizes 
to  the  non-stationary  case  later.  We  will  therefore  select  our  scales  of  interest  by 
smoothing  the  power  spectrum  instead,  or  equivalently,  by  windowing  its  fourier 
transform,  the  autocorrelation  function.  Figure  2.2b  shows  the  spectrum  in  Figure 
2.1  after  it  has  been  thus  smoothed,  t 

What  should  the  form  of  the  convolution  kernel  in  this  smoothing  operation  be? 
A  desirable  smoothing  kernel  would  have  good  locality  (or  resolution)  for  a  given 
amount  of  smoothing.  In  other  words,  it  would  have  small  duration  for  the  given 
duration  of  its  transform.  These  two  durations  are  related  by  the  uncertainty  prin¬ 
ciple:  given  a  function  h{x)  with  fourier  transform  H{s),  if  the  variance  of  |/i(z)|*  is 
(Ax)*  and  the  variance  of  |/f(s)|*  is  (As)*,  then  Ax  As  >  j  [Bracewell  1978|.  Marr 
Si  Hildreth  [1980]  proposed  a  gaussian  smoothing  kernel  (in  a  vision  task)  because 


t  Empirically,  power  and  log  smoothing  often  produce  similar  results. 
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it  is  the  unique  shape  that  meets  the  uncertainty  principle  with  equality. 

2.2.  The  quasi-stationary  case 


The  previous  section  examined  the  analysis  of  stationary  speech  signals.  No  real 
speech  signal,  of  course,  is  purely  stationary.  If  the  frequency  content  of  a  signal 
varies  slowly  with  time,  however,  there  is  a  simple  extension  of  the  previous  results. 
The  idea  is  to  examine  the  signal  over  a  short  duration  window.  Given  a  signal  x(t) 
and  a  window  g{t),  the  short-time  power  spectrum  at  time  t  is 


Sx(,t,u) 


j  g(T)z{t  +  T)e  dr 


(2.2.1) 


Considered  as  a  two-dimensional  function  of  time  and  frequency,  this  signal  repre¬ 
sentation  is  called  a  spectrogram.  Many  different  window  shapes  have  been  used; 
they  typically  are  symmetric,  unimodal,  and  smooth,  e.g.,  a  gaussian  or  a  raised 
single  period  of  a  cosine. 


Signals  for  which  a  window  can  be  found  whose  duration  is  long  enough  to  allow 
adequate  frequency  resolution,  but  short  enough  to  allow  adequate  time  resolution 
are  called  quasi-stationary.  The  example  of  the  previous  section  was,  in  fact,  a 
quasi-stationary  vowel.  Virtually  all  speech  analysis  methods  in  the  past  depend 
on  the  quasi-stationary  assumption. 

2.3.  Non-stationarity 

There  do  exist  signals  for  which  no  window  duration  is  adequate.  A  very  simple 
such  signal  is  the  linear  chirp,  e'J”***,  whose  instantaneous  frequency  increases  lin¬ 
early  with  time.  The  queisi-stationary  assumption  breaks  down  for  sufficently  large 
modulation  slope  m  of  the  signal.  Let  us  examine  this  claim. 
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By  the  uncertainty  principle,  the  product  of  the  time  duration  At  and  the  frequency 
duration  (bandwidth)  Aw  of  a  window  is  bounded  below  by  1/2.  The  window 
duration  and  bandwidth,  in  turn,  determine  the  time  and  frequency  resolution, 
respectively,  in  the  short-time  spectra,  f  In  other  words,  if  the  window  duration 
is  too  small,  then  the  frequency  resolution  will  be  poor  and  if  the  window  duration 
is  too  long,  the  time  resolution  will  be  poor.  Further,  for  a  non-stationary  signal, 
poor  time  resolution  can  also  mean  poor  frequency  resolution  since  the  frequency 
content  will  have  changed  over  the  duration  of  the  window,  blurring  the  spectrum. 

To  illustrate  these  points,  consider  the  short-time  spectrum  of  a  linear  chirp, 
using  a  gaussian  window,  We  can  mectsure  the  the  relative  bandwidth  of 

the  spectrum  for  different  window  sizes  (o’s)  in  terms  of  the  standard  deviation  of 
the  spectrum  (a. 42  the  half-power  bandwidth),  which  is  y/(m^a^  +  l)/2o2^  where 
the  units  are  seconds  and  radians.  Note  that  when  m  ^  0,  this  grows  without  bound 
as  the  window  size  becauses  very  small  or  very  large.  It  ha.s  a  minimum  value  of 
•v/m,  which  occurs  when  the  standard  deviation  of  the  gaussian  is 

We  see  from  this  that  the  minimum  possible  bandwidth  of  the  short-time  spectrum 
of  a  chirp  (using  a  gaussian  window)  grows  with  increasing  modulation  slope.  Fig¬ 
ure  2.3  shows  the  short-time  spectra  of  chirps  of  various  modulation  slopes  using 
windows  that  give  the  minimum  bandwidth.  For  a  slope  of  50  Hz/msec,  the  chirp 
peak  has  been  broadened  by  several  hundred  Hz  in  the  spectrum.  The  point  here 
is  that,  in  theory,  the  ustial  quasi-stationary  spectral  analysis  methods  will  give 
poor  resolution  for  sufficiently  non-stationary  signals.  A  few  examples  from  natural 
speech  will  show  that  such  conditions  arise  in  practice. 


t  This  is  made  precise  by  Theorem  D  in  Section  2.6. 
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Figure  2.4  shows  cepstrally  smoothed,  short-time  spectra  of  various  /w/'s,  uttered 
first  slowly  and  then  increatsingly  rapidly.  The  spectrogram  window  used  was  a 
gaussian  of  4  msec  standard  deviation,  which  has  an  effective  duration  of  about  a 
pitch  period,  the  minimum  duration  that  gives  a  reasonably  stable  spectral  esti¬ 
mate.  The  cepstral  window  is  also  chosen  as  brief  as  possible,  while  still  removing 
the  harmonic  peaks.  Notice  that  the  peak  in  the  spectrum  at  about  1500  Hz,  corre¬ 
sponding  to  F2,  grows  in  bandwidth  with  the  increasing  slope  of  F2  as  seen  in  the 
corresponding  spectrograms  in  Figure  2.5.  In  case  (c),  where  the  F2  slope  is  about 
40  Hz/msec,  F2  is  so  broadened  that  its  peak  (i.e.,  the  local  maximum)  is  lost  in 
the  short-time  spectrum.  Such  an  F2  slope  is  not  uncommon  for  a  /w/.  In  /j/s,  F2 
can  have  large  negative  slopes,  and  in  /r/  contexts,  F3  can  have  very  steep  slopes; 
see  Figure  2.6.  At  consonant-vowel  transistions,  where  the  formant  trajectories  are 
considered  very  important  for  stop  consonant  identification  [Liberman,  et  al  1954  , 
the  formant  motion  can  also  be  very  rapid;  again  see  Figure  2.6. 

It  is  worth  noting  that  natural  sounds  other  than  the  human  voice  can  produce 
non-stationary  signals  that  are  “chirped.”  For  instance,  bird  song  and  bat  cries 
contain  many  rapid  FM  chirps  [Greenewalt  1968;  Marler  1979;  Neuweiler  1977).  If 
a  sound  source  is  in  relative  motion  to  the  listener  then  Doppler  effects  can  cause 
large  frequency  shifts  in  the  received  signal  across  time  [e.g.,  Dudgeon  1984].  + 
Glissandi  of  various  musical  instuments  provide  still  more  examples  of  signals  that 
contain  rapidly  time-varying  spectral  content. 

It  is  also  suggestive  that  neurophysiologists  have  found  that  a  large  population  of 
the  auditory  cells  in  the  mammalian  cochear  nucleus  do  not  respond  optimally  to 

t  Some  bats  (th*  so-called  CF  bats)  emit  continuous  tones,  evidently  depending  on  Doppler  shifts  for 
echolocation. 


Figure  2.4.  Cepstrally  smoothed,  short-time  spectra  of  /w/’s,  uttered  first  very 
slowly,  then  increasingly  rapidly.  In  (c),  F2  is  so  broadened  by  the  analysis  that  its 
peak  (i.e.,  the  local  maximum)  disappears.  Cf  Figure  2.5. 
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(a)  (b)  (c) 


Figure  2.5.  Wide-band  spectrograms  of  the  /w/’s  used  in  Figure  2.4.  Note  that 
F2  remains  clearly  visible  with  increasing  slope  in  the  two-dimensional  display. 


continuous  tones,  but  instead  to  sweep  tones,  with  different  populations  responding 
to  different  preferred  moduiation  slopes  ranging  over  ±15  Hz/msec  [M0ller  1978; 
Britt  Sc  Starr  1976].  Further,  psychophysical  adaptation  studies  have  shown  similar 
directional  selectivity  in  the  human  auditory  system  [Kay  Sc  Matthews  1972;  Regan 
Sc  Tansley  1979). 

The  above  conunents  are  meant  to  call  into  question  the  validity  of  the  quasi¬ 
stationary  assumption  for  speech  and  other  auditory  signals.  We  have  seen  that 
speech  is  not  always  quasi-stationary,  even  in  the  sonbrant  regions.  Assuming  so, 
means  that  important  features  will  be  missed,  having  been  blurred  by  the  anal- 
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ysis.  It  is  interesting  to  note  that  while  the  individual  short-time  spectra  of  the 
non-stationary  signals  described  above  give  a  poor  description  of  the  signals,  their 
spectrograms  are  nevertheless  quite  legible.  This  is  because  when  we  look  at  a 
spectrogram,  we  are  not  confined  to  examining  them  one-dimensionally  along  single 
frequency  slices,  but  instead  we  see  a  two-dimensional  time  and  frequency  surface. 
In  other  words,  time  is  not  used  as  a  parameter  that  varies  over  a  family  of  spectra, 
but  as  one  of  the  intrinsic  dimensions  of  the  representation. 

I  believe,  in  fact,  that  thinking  of  the  initial  speech  processing  as  consisting  of  a 
family  of  independent  one-dimensional  spectral  analyses  parameterized  by  time  is 
inappropiate.  The  problem  should  be  thought  of  cis  a  joint  time-frequency  analysis, 
with  the  relationships  and  trade-offs  between  the  two  dimensions  directly  addressed, 
which  brings  us  to  the  next  section. 


2.4.  Joint  time-frequency  representations 


Various  ways  have  been  used  to  express  signal  energy  as  a  joint  function  of  time 
and  frequency.  Certainly  the  most  popular  is  the  spectrogram. 


Si(t,  w) 


J  g(T)x(t  +  r)e  dr 


(2.4.1) 


which  is  just  the  short-time  spectra  described  above  displayed  two-dimensionally. 
The  fact  that  the  simultaneous  time  and  frequency  resolution  in  the  spectrogram  is 
bounded  by  the  uncertainty  relation  has  led  others  to  seek  representations  that  do 
not  have  this  limitation. 


This  is  usually  formulated  in  terms  of  the  marginals  (or  projections)  of  the  signal 
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representation  [Cohen  1966].  Let 


oo 


— OO 

(2.4.2a) 

00 

X2(w)  =  f  Fx(t,cj)dt. 

(2.4.26) 

— oo 


Perfect  time  and  frequency  resolution  in  this  formulation  requires  that 

jri(t)  =  |i(t)|*  and  ^2(0))  =  lX(w)l^.  (2.4.3) 

An  example  of  a  joint  time-frequency  representation  that  satisfies  these  require¬ 
ments  is  the  Wigner  distribution, 

OO 

Wx(t,u))  =  j  +  r/2)i'(t  -  r/2)  dr,  (2.4.4) 

—  OO 

which  is  currently  quite  popular  in  the  signal  processing  literature  [Classen  ci:  Meck- 
lenbrauker  1980a, cj. 

The  Wigner  distribution  of  an  impulse,  x(t]  =  6(t  —  to)  is  w)  =  <f(t  —  to),  i-e-, 
the  signal  energy  is  taken  to  lie  on  the  vertical  line  t  =  to  in  the  time-frequency 
plane.  Similarly,  for  a  complex  exponential,  y(t)  =  the  signal  energy  lies  on 
the  horizontal  line  at  w  =  wo  (*^y(t)W)  =  2x5(w  —  Wq)),  and  for  a  linear  chirp, 
z(t)  =  the  energy  lies  on  the  slanted  line  w  =  mt  +  wo  (Wz(t,w)  = 

2n6(u!  —  Wo  —  mt))  (see  Figure  2.7a). 

In  contrast,  the  spectrogram  of  these  signals  consist  of  broadened  lines  (see  Figure 
2.7b).  There  is,  in  fact,  a  simple  relation  between  the  spectrogram  and  the  Wigner 
distribution  of  a  signal  z(t); 

5,(t,w)  =  ^W,(t,w)  ♦*  W*(t,w), 


(2.4.5) 
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Figure  2.7.  VVigner  distribution  and  spectrogram  of  some  mono-component  sig¬ 
nals.  (a)  The  Wigner  distribution  resolves  these  signals  as  perfectly  narrow  lines  in 
the  time-frequency  plane,  (b)  The  spectrogram  is  a  smoothed  version  of  the  Wigner 
distribution  (e.g.,  if  the  spectrogram  window  is  a  gaussian,  then  the  smoothing  ker¬ 
nel  is  a  2-D  gaussian).  The  lines  are  broadened  in  this  representation. 
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where  **  denotes  two-dimensional  convolution  and  Wg  is  the  Wigner  distribution  of 
the  window  [Classen  Si  Mecklenbrauker  1980c).  If  g{t)  is  a  gaussian, 
then  its  Wigner  distribution  is  also  simple;  it  is  just  a  two-dimensional  gaussian, 
Wg(t,u)  =  Thus,  the  two-dimensional  convolution  of  the  Wigner 

distributions  in  Figure  2.7a  by  a  two-dimensional  gaussian  will  give  the  spectrograms 
in  Figure  2.7b. 

If  the  duration  of  the  gaussian  spectrogram  window  is  decreased,  then  the  2-D 
gaussian  that,  in  essense,  convolves  the  Wigner  distribution  to  give  the  spectrogram 
becomes  narrower  in  time,  but  wider  in  frequency,  and  vice  versa.  It  should  be  clear 
from  this  example  that  the  spectrogram  does  not  meet  the  marginal  requirement. 

On  the  other  hand,  the  Wigner  distribution  itself  has  some  undesirable  proper¬ 
ties.  In  particular,  multi-component  signals  give  rise  to  cross  terms  that  cannot 
be  attributed  much  physical  significance.  For  example,  the  Wigner  distribution  of 
x{t)  =  cosuiQt  is  =  j[5(u»  —  Wo)  +  5(w  -f  wo)  +  S(w)2cos  2wot]  (see  Figure 

2.8a).  The  last  term,  which  lies  on  a  horizontal  line  at  the  frequency  origin  (varying 
sinusoidily  in  amplitude),  seems  spurious.  The  spectrogram  of  cos  wot,  however,  is 
just  two  broadened  lines  at  w  =  ±wo,  which  seems  better  behaved  with  respect  to 
superposition,  since  cos  Wq!  =  -I- e' '"»')  (see  Figure  2.8b).  The  cross  term  is, 

in  effect,  smoothed  out  by  the  convolution  that  transforms  the  Wigner  distribution 
into  the  spectrogram. 

These  examples  illustrate  that  there  are  various  (possibly  conflicting)  properties 
that  we  might  desire  of  a  time-frequency  representation,  e.g.,  good  time  and  fre¬ 
quency  resolution,  and  superposition  for  multi-component  signals.  We  shall,  in  fact, 
approach  the  problem  of  choosing  our  time-frequency  energy  representation  by  first 


2.4.  Joint  time-freaueitcv  reoresentatJona 


30 


12.5.  Design  criteria  for  joint  tiwe-freauencv  representations _ ^ 

specifying  a  set  of  desirable  properties  that  the  transform  should  satisfy,  and  then 
deriving  its  form. 

2.5.  Design  criteria  for  joint  time-frequency  representations 

We  will  restrict  the  discussion  to  the  quadratic  transforms  of  the  signal,  which  have 
the  form 

OO  00 

=  j  j  h(Ti,T2;t,Ul)x(Tl)x‘(T2)dTidT2,  (2.5.1) 

—  OO  -OO 

where  /i(ri ,  tj;  t,  tj)  is  an  arbitrary  function.  This  condition  is  \n  'osed  because  it 
results  in  a  particularly  manageable  class,  and  because  the  representation  of  energy 
as  a  quadratic  function  of  the  signal  seems  reasonable  by  analogy  to  other  definitions 
of  energy.  The  class  is  quite  large  and  includes  many  of  the  joint  time-frequency 
representations  that  have  been  previously  proposed,  such  as  the  spectrograms,  the 
Wigner  distribution,  and  the  Rihaczek  distribution  [cf.  Claasen  &  .Vfecklenbrauker 
1980c|. 

From  this  class  of  representations,  we  seek  ones  that  satisfy  the  following  criteria: 

(Cl)  Shift  invariance:  A  shift  in  time  or  frequency  of  the  signal  should  result  in 
a  corresponding  shift  in  time  or  frequency  in  the  transform.  Let  y{t)  =  x(t  -  r)  and 
z{t)  =  e'^*x(t).  Then  we  require  =  F,(t  -  r,  w)  and  F’x(«,  w)  =  Fz{t,uj-<p). 

This  property  is  desirable  if  we  want  to  interpret  the  two  dimensions  of  the  transform 
as  time  and  frequency. 

Transforms  satisfying  this  condition  can  be  put  in  the  forms 


(2.5.2) 
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and 

Fzit,cj)  =  7~^\9(t,i/)Az(t,u)],  (2.5.3) 

where  “**”  denotes  two-dimensional  convolution,  is  the  Wigner  distribution 

oo 

Wz(t,cj)  =  J  +  r/2)x‘{t  -  t/2)  dr,  (2.5.4) 

-00 

uj)  is  an  arbitrary  kernel  function,  T  is  the  2-D  fourier  transform  in  the  form 
J[q(t,u))]  =  ^  /  /  dtduj,  i(T,i/)  =  T{<i>(t, uj)\,  and  Ax  is  the 

— OO  “OO 

time-frequency  autocorrelation  function  t 

oo 

Ax{t,u)  =  I\Wx{t,ui)]  =  j  e''**'‘i(t -I- r/2)i'(t  -  t/2)  dt  (2.5.5) 

—  OO 

for  x{t)  [Claasen  ic  Mecklenbrauker  1980c].  Note  that  for  a  spectrogram,  4>(t,u)  is 
the  Wigner  distribution  of  the  spectrogram  window,  by  Eq.  2.4.5  and  Eq.  2.5.2. 

(C2)  Positivity:  The  signal  energy  at  a  given  point  in  time  and  frequency  should 
be  real  and  positive:  Fz{t,w)  >  0  for  all  x,  t,  and  w.  This  seems  appropriate 
for  interpreting  the  transform  as  an  energy  distribution.  Some  authors  have  argued 
against  the  positivity  requirement  (e.g.  Claasen  ic  Mecklenbrauker  1980cj.  We  shall 
examine  the  consequences  of  lifting  this  condition  in  the  next  section. 

(C3)  Superposition:  This  idea  is  that  the  time-frequency  representation  of  a 

multi-component  signal  should  be  a  simple  superposition  of  its  components.  The 

straight-forward  linear  formulation  of  this,  i.e.,  Fz+c}i(t,u))  =  Fz(t,t^)  +  cFy(«,u;), 

however,  is  inconsistent  with  the  quadratic  nature  of  the  transform,  and  the  shift- 

invariance  property  Cl.  This  apparent  shortcoming  is  also  true,  for  example, 

t  Some  anthon  call  this  the  zmbifuity  fanctha  (e.g.,  Claasen  tc  Mecklenbrauker  1980a|:  others  reserve 
this  term  for  \Az{t,v]\^  [e.g.,  Van  Trees  1968]. 
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of  the  spectrogram  (Eq.  2.4.1).  Nevertheless,  we  usually  think  of  the  conven¬ 
tional  spectrogram  as  being  well-behaved  under  superposition.  This  is  because 
non-overlapping  components  do  superimpose,  i.e.,  ■5i+y(t,w)  =  Sx(t,U))  -I-  Sy(f,w) 
when  5x(t,  (j)5y(t,u>)  =  0.  There  are  no  cross  terras  in  this  case.  On  the  other  hand, 
the  Wigner  distribution  does  not  have  this  property,  suffering  from  cross  terms  to 
which  there  cannot  be  attributed  much  physical  significance. 

We  shall  require  this  property  for  our  time-frequency  representation,  namely 

Fz+^(t,ui}  =  Fx(t,u>}  +  Fy(t,u)  when  Fx(t,w)F^(t,uj)  =  0.  (2.5.6a) 

More  generally,  we  would  like  »  Fi(t,w)-t-Fy(t,  w)  when  Fi(«,w)fy(t,  w)  a; 

0.  Stated  more  precisely,  we  require  for  any  c  >  0,  there  exists  a  5  >  0  such  that 

lf,+y(f,w)  -  [Fx(f.w) Fj,(t,  w)ll  <  «  when  \Fx{t,u)Fy{t,oj)\  <  6.  (2.5.66) 


(C4)  Locality:  Signal  energy  that  is  localized  in  time-frequency  should  remain 
localized  in  time-frequency  in  the  transform.  The  advantage  of  the  Wigner  distri¬ 
bution  is  that  it  is  perfectly  localized  according  to  various  criteria,  such  as  preserving 
the  marginal  distributions  (Eq.  2.4.3)  and  the  finite  support  properties  [see  Claasen 
ic  Mecklenbrauker  1980a|.  t  The  Wigner  distribution,  however,  does  not  satisfy 
the  positivity  (C2)  or  superposition  (C3)  properties,  as  indicated  earlier.  In  fact, 
positivity  (and  thus,  as  we  shall  see,  superposition)  is  inconsistent  with  the  time  and 
frequency  marginal  conditions  [Claasen  Si  Mecklenbrauker  1980c].  Fortunately,  for 
our  purposes,  we  do  not  require  perfect  locality,  so  we  can  relax  the  above  conditions 
somewhat. 


t  Th«  finite  mpport  property  atatea  that  if  a  aignal  baa  finite  extent  in  time  or  frequency  then  its 
repreaentation  will  have  the  aame  extent  in  the  correaponding  variable. 
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From  Eq.  2.5.2,  the  transform  kernel  can  be  viewed  as  the  point  spread 

function  on  the  perfectly  localized  Wigner  distribution.  We  can  therefore  measure 
the  locality  of  the  transform  in  time  and  frequency  in  terms  of  the  variances  t 


and 


,  (  f  dtdio 


2  f  f  dtdu 


(2.5.7a) 


{2.5.7b) 


where  we  assume  that  the  center  of  mass  of  |^(t,  w)|*  is  at  the  origin.  * 


In  general,  these  two  measures  are  not  enough;  an  additional  locality  measure  is 
important,  the  covariance 


f  f  dtdbj 

w)l*  dt  du> 


(2.5.7c) 


Together,  at,  a^,  and  atu  determine  the  covariance  matrix  and  the  associated  con¬ 
centration  ellipse  in  the  (t,u)  plane. 


[t 


\<^tu  ot,  j  \uj 


(2.5.8) 


When  atu  =  0)  the  major  and  minor  axes  of  the  concentration  ellipse  coincide  with 
the  time  and  frequency  axes  (Figure  2.9a).  More  generally,  the  concentration  ellipse 


X  The  generality  of  this  approach  depends  on  the  Wigner  distribution  uoique/y  satisfying  ‘perfect’ 
locality.  Cohen  has  shown  that  a  quadratic  transform  that  satishes  the  shift-invariance  property 
(Cl)  will  meet  the  time  and  fKquency  marginal  conditions  (Eq.  2.4  3)  if  0(r,  0)  =  1  for  all  r  and 
^(0, 1/)  »  1  for  all  I/.  These  marginal  conditions  essentially  guarantee  that  an  impulse  and  a  complex 
exponential  are  not  'blurred*  by  the  time>h’equency  representation*  but  are  not  strong  enough  to 
also  guarantee  that  a  linear  chirp  is  not  'blurred*  (see  Figxire  2.7a).  This  additional  condition  is 
met  uniquely  by  the  Wigner  distribution.  In  other  words,  we  intery-ret  perfect  locality  to  mean  that 
the  signal  transform  does  not  spread  the  signal  energy  in  any  direction  in  time-frequency  (not  just 
the  horisontal  and  vertical  directions).  We  postpone  a  more  thorough  discussion  of  this  point  until 
Section  2.8,  when  the  necessary  mathematical  machinery  will  be  introduced. 

*  This  assumption  is  not  very  restrictive  on  the  form  of  the  transform,  since  we  can  always  shift  ui) 
in  time  and  frequency  to  satisfy  it.  This  shift,  in  turn,  shifts  the  transform  in  time  and  frequency. 


frequency 
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time  time 


Figure  2.9.  Concentration  e/Jipses  for  transform  ierneis.  faj  Non-directionaJ  kerneJ 
(fftu  —  0):  the  co-ordinate  axes  can  be  re-scaled  to  make  the  concentration  ellipse  a 
circle.  Thus  viewed,  the  corresponding  transform  spreads  the  signal  energy  equally 
in  all  time-frequency  directions,  (b)  Directional  kernel  (otu  7^  Oj:  the  co-ordinate 
axes  cannot  be  re-scaled  to  make  the  concentration  ellipse  a  circle.  The  correspond¬ 
ing  transform  always  has  better  resolution  in  some  time-frequency  directions  than 
others. 

may  be  oriented  obliquely  relative  to  the  co-ordinate  axes  (Figure  2.9b).  We  shall 
call  transforms  that  satisfy  the  condition  atu  =  0  on  their  kernel  non-directionally 
localized.  This  name  is  appropriate  since  we  can  rescale  the  co-ordinate  axes  to 
make  the  concentration  ellipse  a  circle  under  this  condition.  Thus  viewed,  the 
transform  spreads  energy  imiformly  in  ail  directions  in  time-frequency.  On  the  other 
hand,  if  otw  ^  0,  then  this  does  not  hold,  and  the  transform  will  be  directionally 
localized,  always  having  better  resolution  in  some  time-frequency  directions  than 
others  regardless  of  the  scaling  of  the  axes. 

The  analysis  of  the  non-directional  transforms  is  more  straight-forward.  We  there¬ 
fore  restrict  our  attention  to  this  case  until  Section  2.8,  when  we  shall  examine  the 
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more  general  case.  We  will  see  there  that  the  principal  results  are  essentially  the 
same  as  non-directional  case,  suitably  generalized.  The  analysis,  however,  is  more 
complex,  and  is  thus  best  left  until  later. 

To  summarize,  given  a  non-directional  transform  (oiu  =  0),  and  measure  its 
degree  of  locality  in  time  and  frequency.  The  smaller  at  and  a^  are,  the  better  the 
time  and  frequency  resolution. 


(C5)  Smoothness:  Similar  to  the  stationary  case,  different  aspects  of  the  speech 
signal  can  arise  at  different  scales  in  time-frequency.  For  example,  voiced  excitation 
can  give  rise  to  fine  scale  structure  on  the  order  of  the  pitch  period  in  the  time 
dimension  and  the  fundamental  frequency  in  the  frequency  dimension.  The  formant 
structure,  on  the  other  hand,  arises  at  a  somewhat  larger  scale.  Thus,  one  of 
the  design  parameters  for  our  transform  is  the  scale  in  time-frequency  we  wish  to 
examine.  Said  differently,  we  want  the  transform  to  be  smooth  in  time-frequency 
to  a  given  degree. 


This  notion  of  scale  can  be  be  formalized  by  measuring  the  distribution  of  the  spatial 
frequencies  present  in  Fi(t,  w),  i.e.,  the  distribution  of  energy  about  the  origin  of  its 
2-D  fourier  transform.  Since  T[Fz{t,w)\  =  $(r, i')/4,(r,i/)  (Eq.  2.5.3),  the  relative 
amount  of  spread  is  determined  by  the  choice  of  $(t,  (/),  which  windows  the  time- 
frequency  autocorrelation  function.  We  can  measure  this  spread  in  terms  of  the 
variances 


j  f  f  T^\<Sf{T,u)\^  dr  du 
’’  //  |<t(r,j/)PdT du 

J  f  dr  du 


(2.5.9a) 

(2.5.96) 


and 
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_/ f  dr  du 

/  f  dr  di/ 


(2.5.9c) 


where  we  assume  that  the  center  of  mass  of  |$(r,  is  at  the  origin,  t  These 
determine  the  covariance  matrix  and  the  associated  concentration  ellipse  in  the 
{t,i/)  plane, 


{r 


(2.5.10) 


When  =  0,  we  call  the  transform  non-directionally  smooth.  In  this  case,  it  is 
possible  to  rescale  the  co-ordinate  axes  to  make  the  concentration  ellipse  a  circle, 
and  thus  viewed  the  transform  smoothes  the  signal  in  time-frequency  uniformly  in 
all  direction  in  time-frequency.  On  the  other  hand,  if  0.  then  this  does  not 

hold,  and  the  transform  will  be  directionally  smooth,  always  smoothing  more  in 
some  time-frequency  directions  than  others  regardless  of  the  scaling  of  the  axes. 
Just  like  the  locality  condition,  we  will  restrict  attention  now  to  the  non-directional 
transforms.  We  consider  the  more  general  case  in  Section  2.8. 

To  summarize,  given  a  non-directional  transform  (Eri/=o),  St  and  E„  measure  its 
scale  in  time  and  frequency.  The  smaller  E,  and  E„  are,  the  larger  the  selected 
scales. 


Observe  at  this  point  the  parallels  between  the  stationary  and  non-stationary  anal¬ 
yses.  If  we  think  of  the  Wigner  distribution  as  the  non-stationary  analog  to  the  raw 
power  spectrum,  then  the  time-frequency  autocorrelation  function  (the  Wigner  dis¬ 
tribution’s  2-D  fourier  transform)  is  the  2-D  analog  to  the  autocorrelation  function 
(the  power  spectrum’s  fourier  transform).  Further,  windowing  the  time-frequency 
autocorrelation  function  smoothes  the  Wigner  distribution,  just  as  windowing  the 


This  assumption  will  be  true  if  the  transform  is  real 


12.6.  Relations  among  the  design  criteria _ ^ 

autocorrelation  smoothes  the  raw  spectrum.  In  both  cases,  the  design  decisions 
for  the  resulting  transform  require  selecting  a  convolution  kernel  that  satisfies  both 
locality  and  smoothness  requirments.  In  fact,  we  shall  see  in  the  next  chapter  that 
the  analogy  is  even  closer. 

2.6.  Relations  among  the  design  criteria 


The  various  design  criteria  for  our  time-frequency  energy  representation  are  not 

independent.  We  shall  state  the  important  relationships  among  them  in  this  section. 

Throughout  this  section  we  assume  that  the  input  signal  x{t)  is  finite  energy,  (i  e., 

xeLt)  and  that  Fi{t,u)  is  a  quadratic  transform  of  the  signal.  This  means  that 

00 


Fzit,oj)  =  {Tx,x}  where  (x,i/}  = 
operator  on  £2. 


/  x(a)i/' (a)  da  and  Tt,u  is  a  (bounded)  linear 
—00 


•  Shift-invariance  &  Positivity:  Together  these  imply  that  the  transform  can 
be  expressed  as  a  superposition  of  spectrograms,  t 


Theorem  A.  Let  F,(t,w)  be  positive  and  shift-invariant.  Then  it  has  the  form 

00 

L’i(t.<v)  =  J  Sz{t,u;ga)da,  (2.6.1) 

—  00 

where  Sx(t,ur,g]  is  the  spectrogram  having  g  as  its  window. 

Proof:  The  positivity  of  Fz(*><*')  means  that  is  a  positive  operator  and  therefore 
has  a  square  root  A,  i.e., 

Fz  =  {A'Ax,x)  =  {Ax,  Ax)  =  \\Ax\\\  (A.l) 


t  Bouacliach«,  et  al  [1979|  incorrectiy  8tat«  that  a  positive  and  shift-invariant  quadratic  transform  is 
necessarily  a  spectrogram  Claasen  k  Mecklenbrauker  |1984|  point  out  this  error,  mentioning  that 
linear  combinations  of  spectrograms  must  be  included. 


F 
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where  ||i(a)l|^  =  /  |i(a)|*<ia  [see  Rudin  1973).  Representing  the  linear  operator  A 
in  terms  of  its  impulse  response  -4<,u[i(oi)]  =  /  h(r,a;t,ij)x{T)  dr  and  substituting 
into  Eq.  A.l  gives 


00  00 

.-J  J 


h{a,T\  t,u;)i(r)  dr 


da. 


{A.2) 


By  time  and  frequency  shift-invariance, 

oo  oo 

Fx(t  +  a,Lj  +  (fi)  =  j  J  h(a,r;  t,  w)i(r -t- dr 


da. 


Setting  f  =  w  =  0  gives 

OO  I  OO  j  2 

=  I  [  h(a,r;0, 0)i(r -+•  dr  da. 


or,  with  S'a(r)  =  /i(o,r;0,0), 

oo  I  oo 

~  j  \j  9a(^)x(r  +  dr  da. 


From  Eq.  2.4,1,  we  see  the  outer  integrand  is  the  spectrogram  ■Si(t,w;9a),  giving 
Eq.  2.6.1.  /// 


•  Positivity  &.  Superposition;  The  next  theorem  shows  that  positivity  implies 
superposition.  In  fact,  it  implies  a  strong  form  of  superposition,  as  in  Eq.  2.5.6b. 

Theorem  B.  If  f,{t,w)  is  positive,  then 


|F,+,(t,w)  -  [F,(t,w)  +  F,(t,w)l|*  <  4Fx(t,cj)F,(t,w). 


(2.6.2) 
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Proof:  From  the  elementary  fact  about  inner  products 

Up  +  9ll*  =  IIpII*  +  2  Re  {p,  q)  +  ||? ||* 

it  follows  that 

lllP  +  <7ll"-lllPll*  +  llgini*  =  4|fee(p,9)l^ 

<  4l(p,9>|^ 

Since  {p,q)  <  l|pl|  HqH, 

<  4||pf  ll9ll^ 

Substituting  p  =  Ax  and  q  =  Ay  above  and  using  Eq.  A.l  gives  Eq.  2.6.2.  /// 

If  the  transform  is  real,  the  converse  of  this  theorem  is  also  true;  i.e.,  superposition 
implies  either  F*  or  —Fz  is  positive. 

Theorem  C.  Let  Fi((,a>)  be  real  and  satisfy  superposition  (Eq.  2.5.6a).  Then 
either  Fj(t,w)  or  — F,{t,w)  is  positive. 

Proof:  Step  1.  First  we  show  under  the  hypotheses  of  the  theorem  {Tx,x)  =  0  => 
Ti  =  0. 

Superposition  says 

{Tx,x){Ty,y)  =  0  =►  (r(i  +  y),*  +  y>  =  (Tx,x)  +  (Ty.y).  (C.l) 

Since  the  form  {Tx,x)  is  always  real,  {Tx,y)  =  {Ty,x)',  so 

(T(x  +  y),x  +  y)  =  <Ti,z>  +  2  J?e(ri,y)  +  {Ty,y). 

Thus,  from  Eq.  C.l, 


(Tx,x){Ty,y)  =  0  =>•  Re{Tx,y)  =  0. 


(C.2) 
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Substituting  ix  into  Eq.  C.2  shows  that  Im{Tx,y)  =  0  also,  so  that 


41 


{Tx,x){Ty,y)  =  0  =►  {Tx,y)  =  0.  (C.3) 

Suppose  that  {Tx,x)  =  0.  Then  by  Eq.  C.3,  {Tx,y)  =  0  for  all  y.  If  wc  let  y  =  Tx, 
then  {Tx,Tx)  =0  and  thus  Tx  =  0,  as  desired. 

Step  2.  We  now  show  that  (Tz,z)  =  0  =>■  Tz  =  0  implies  ±r  is  positive.  Suppose 
(Tx,  i)  >  0  and  (Ty,  y)  <  0.  Let  z  =  kx  +  y  where  k  is  real.  Then 

{Tz,z)  =  k^{Tx,x)  +2k  Re{Tx,y)  +  (Ty,y). 

This  is  a  quadratic  in  k,  and  since  {Tx,x){Ty,y}  <  0,  it  has  two  distinct  real  zeroes. 
However,  since  Tx  ^  0,  Tz  =  kTx  +  Ty  has  only  one  zero  in  k.  Therefore,  there 
exists  a  value  of  k  such  that  {Tz,z)  =  0  but  Tz  ^  0,  contradicting  the  hypothesis, 
and  implying  ±T  is  positive.  /// 

This  last  theorem  shows  that  we  can  replace  the  positivity  condition  (C2)  with  the 
sole  requirement  that  the  transform  be  always  real,  and  have  an  equivalent  set  of 
properties.  In  other  words,  the  transform  will  necessarily  be  positive  if  superposition 
holds,  and  if  positivity  is  abandoned,  cross  terms  will  necessarily  prove  a  problem 
for  multi-component  signals  such  as  speech. 


•  Positivity  &  Locality:  The  positivity  condition  places  a  limit  on  the  time- 
frequency  locality  of  the  transform.  When  the  transform  is  positive,  it  is  some¬ 
times  convenient  to  measure  locality  in  terms  of  the  variances  of  <fi{t,ui)  instead  of 
|(^(t,u;)p.  We  define 


2  _// 

^  f  /  w)  dt  du! 


(2.6.3a) 


and 
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2  __f  f  dt  dai 

f  f  4>{t,L]  dt  dui  ’ 


(2.6.36) 


where  we  assume  that  the  center  of  mass  of  ^(t,  ui)  is  at  the  origin,  t  When  the 
transform  is  positive,  we  claim  that  these  variances  are  non-negative.  To  show  this, 
first  suppose  the  transform  is  a  spectrogram.  Then  4>{t,  ui)  is  the  Wigner  distribution 
of  the  spectogram  window  g{t),  and  using  Eq.  2.4.3,  it  is  easy  to  see  that 

cTj.  =  var|p(t)p  and  on  =  var|G(w)|*,  (2.6.4) 

t 

which  are  clearly  non-negative  [cf.  DeBruin].  More  generally,  if  the  transform  is 
positive,  it  follows  directly  from  Theorem  A  that 


Oj.  =  j  Ca  vai  \ga{t)\'‘ da  and  Oq  =  j  var|Go(t.t;)p  do  (2.6.5) 


where 

00 

f  \9a{t)\^dt 

Ca  =  -  (2.6.6). 

/  /  \9Q'{t)\‘^  dt  da' 

—00 

These  are  again  non-negative  quantities. 


Eq.  2.6.5  shows  that  oj.  is  the  (weighted)  average  window  variance  in  the  represen¬ 
tation  of  Fz{ty<jj)  as  a  superposition  of  spectrograms.  Since  a  spectrogram’s  values 
at  a  given  time  depend  only  on  signal  values  under  its  window,  we  see  that  a  positive 

transform  at  a  time  t  effectively  depends  only  on  signal  values  within  a  few  ox  of  t. 

* 


t  Thifl  assumption  is  necessary  for  the  term  Variance’  to  apply.  It  is  not  necessary,  however,  for  the 
nncertainty  relations  presented  below  to  be  true  [cf.  DeBruinj. 

*  This  is  a  stronger  notion  of  time  locality  than  in  the  previous  section.  There,  time  locality  essentially 
measured  how  the  transform  spread  an  impulse.  The  Wigner  distribution  is  perfectly  localized  in 
this  sense,  because  it  represents  the  energy  of  an  impulse  at  time  to  entirely  on  the  vertical  line 
t  =  to  in  the  time-frequency  plane.  This  does  not  mean  that  the  Wigner  distribution’s  values  at 
time  to  depend  only  on  the  signal  value  at  to.  Quite  the  opposite  is  true,  they  depend  on  the  entire 
signal.  (In  fact,  the  signal  can  be  recovered  horn  the  Wigner  distribution’s  values  at  any  fixed  time 
to  (up  to  a  multiplicative  constant)  [see  Claasen  Si  Mecklenbrauker  1980a|.)  However,  when  the 
transform  is  positive  these  two  notions  of  locality  coincide. 


2.6.  Relations  among  the  design  criteria 
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t  The  next  theorem  states  an  important  uncertainty  relation  for  positive  transforms. 

It  bounds  the  simultaneous  time  and  frequency  resolution  that  can  be  obtained  by 
such  a  transform. 

'  Theorem  D.  Let  be  positive  and  shift-invariant.  Then 


Proof:  From  Eq.  2.6.5, 

=  j  Caol  da  j  c„E^  da 

where  a\  =  var|ffa(t)|^  and  =  var|G(,(u;)|*.  By  the  Schwarz  Inequality, 


2  2 
<Tj.<Tq 


>-{! 


The  classical  uncertainty  relation  applied  to  ga(t)  gives  aaT.c  >  j,  so 


2  2 


since  f  Ca  da  =  1  from  Eq.  2.6.6,  Taking  square  roots  yields  the  desired  result.  /// 


•  Locality  ic  Smoothness:  Just  as  in  the  stationary  case,  locality  and  smooth¬ 
ness  are  conflicting  properties.  Greater  smoothness  means  poorer  locality  and  vice 
versa,  other  things  being  equal.  This  follows  formally  from  a  two-dimensional  gen¬ 
eralization  of  the  classical  uncertainty  relation. 


Theorem  E.  If  Fi(t,u>)  is  shift-invariant,  then  >  j  and  o^Er  >  with 

equality  in  both  these  relations  iff 


(2.6.7) 


Proof:  First,  we  show  that  >  j,  Let  A(t,r)  =  ^/  dij.  Then 

=  T[4>{t  ,  cj)]  =  /  A(<,  r)e  dt.  Applying  the  classical  uncertainty  relation 
to  A(f,r)  w.r.t.  t  gives 


/O)  oo  \i/oo  oo 

i  j  |A(t,r)l*dt  j  <  j  t’|A(t,r)j’(ft  j  i/) 


Integrating  E.l  over  r  and  using  the  Schwarz  Inequality 


00  /  00  oo  \  a 


00/00  00  \  a 

<  j  j  t*lA(t,r)|*<it  j  i/*|$(r,  dl/ j  dr 


/  00  00  00  00 
~{  j  j  i^|A(t,r)l^dt  dr  j  j  p*  $(t,  ;/)  '  di/ dr 


By  ParsevaTs  thereom, 


00  00 

j  |A(t, r)|* dr  =  i  j  \d>{t,w)\^du^ 


(E.Za) 


00  00 

I  \Ht,r)\Ut=~  J  \nr,u)\Uu:. 


Substituting  Eq.  E.3  into  Eq.  E.2  yields 


00  00  /  00  00  00  00 

-  if  f  t*l'^(*,i*i)|’dtdw  j  j  i/^\i[T,u)\'^  di/dr 


Since  /  f  \<f>{t,ij]\^  dldu!  =  / /[$(r,t/)|*drdj/,  we  have  i  <  t7(E„.  By  similar  rea¬ 
soning,  J  <  CuiEr. 


%2.7.  Satisfying  the  design  criteria  —  the  Gaussian  transform _ ^ 

Direct  computation  of  the  variances  shows  that  if  <^(t,  w)  is  a  2-D  gaussian  (Eq.  2.6.7) , 
then  these  inequalities  are  satisfied  with  equality.  Showing  the  converse  is  some¬ 
what  more  involved.  If  these  inequalities  are  satisfied  with  equality,  then  from  the 
classical  uncertainty  relation  and  the  proof  above,  it  follows  that  is  Gaussian 

in  each  of  its  variables.  In  other  words, 

for  all  r  and  i/,  where  a  >  0  and  c  >  0.  Thus,  a{u)T^  -H  b(i/)  =  c{t)u^  ^  d(r). 
Setting  K  =  0  and  t  =  0  shows  that  5(i/)  =  c(0)t'*  +  d(0)  and  d(r)  =  a(0)r^  -i-  6(0), 
respectively,  so 

a(u)T^  +  c{0)u^  +  (i(0)  =  c(r)t/^  -i-  o(0)r^  »  6(0).  {E-^) 

Twice  differentiating  this  w.r.t.  r  and  u  gives  o"(i/)  =  c"(t)  for  all  r  and  i/;  thus 
they  are  constant.  Taylor  expanding  a{u)  and  c(t),  substituting  into  Eq.  E.6,  and 
equating  terms  shows  that 

=  e-K0)r’+c(0)i'’  +  i«"(0)T’i/’-l-6(0)|_  Y) 


By  the  symmetry  of  the  two  domains,  <^((,w)  must  have  the  same  form.  Together, 
these  imply  that 


for  all  t  and  r.  Taking  the  logarithm  of  Eq.  E.8,  clearing  of  fractions,  and  equating 
terms  shows  that  Ti  =  02  =  0.  Thus,  o”(0)  =  0  in  Eq.  E.7,  which  implies  Eq.  2.6.7, 
as  desired.  /// 
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2.7.  Satisfying  the  design  criteria  —  the  Gaussian  transform 


From  the  last  theorem,  we  see  that  a  two-dimensional  gaussian  transform  kernel 
gives  the  best  time-frequency  locality  for  a  given  smoothness.  The  resulting  repre¬ 
sentation  will  be  called  the  GdussUn  transform  of  the  signal,  t  By  specifying 
(=  2<Tj  )  and  Oq  (=  2ff^)  for  this  kernel  we  are,  in  effect,  selecting  a  particular  time 
and  frequency  scale  for  the  transform.  We  may  choose  any  values  we  wish  provided 
^  j  (positivity),  and  the  resulting  transform  will  best  satisfy  all  our  design 
properties.  The  result  is  clearly  a  generalization  of  the  solution  in  the  stationary 
case,  where  a  gaussian  convolution  kernel  of  different  sizes  selected  different  spectral 
scales. 

When  transform  is  equivalent  to  a  spectrogram  using  a  gaussian 

window.  For  larger  values  of  <rxcrn>  this  transform  is  equivalent  to  convolving  such 
a  spectrogram  with  a  2-D  gaussian. 

As  a  note  on  its  implementation,  this  last  fact  was  used  to  compute  the  figures 
below.  A  more  direct  method  would  be  to  compute  the  Wigner  distribution  and 
then  perform  the  2-D  convolution  specified  in  Eq.  2.5.2.  This  is  not  very  efficient  in  a 
digital  implementation,  however,  since  the  Wigner  distribution  has  to  be  computed 
at  high  sampling  rates  to  avoid  aliasing.  * 

By  performing  a  convolution  on  a  spectrogram,  far  fewer  time  and  frequency  samples 
need  to  be  computed,  since  the  spectrogram  is  already  a  smoothed  version  of  the 

t  We  have  choMO  this  name  for  obvious  reasons.  This  risks,  hoivever,  confusion  with  the  Gauss- 
Weierstrass  transformation  (see.  Hille  1948|.  In  fact,  the  Gaussian  transform  of  the  signal  z(t]  is  the 
two-dimensional  Gauss-Weierstrass  transformation  of  the  Wigner  distribution  Wz(t)  (see  De  Bruijn 
1967). 

*  In  general,  the  Wigner  distribution  must  be  sampled  in  time  at  twice  the  Nyquist  rate  of  the  signal 
(Claasen  ti  Mecklenbrauker  1980b]. 
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Wigner  distribution.  Further,  since  the  gaussian  kernel  is  uncorrelated  in  time 
and  frequency,  the  2-D  convolution  is  separable,  and  can  be  performed  as  separate 
1-D  convolutions  in  the  time  and  frequency  directions,  resulting  in  a  relatively 
inexpensive  computation. 

2.8.  Directional  time-frequency  transforms 

So  far,  we  have  assumed  that  the  time-frequency  energy  representation  was  non- 
directional  in  the  sense  that  the  covariances  otu  and  of  the  transform  kernel 
were  both  zero.  We  shall  now  examine  the  consequences  of  lifting  this  condition. 
We  begin  with  an  example.  Consider  the  two  transforms  specified  by  the  kernels 

and 

These  transforms  have  identical  and  Ou,,  but  differ  in  the  sign  of  atw  Figure 
2.10  shows  their  concentration  ellipses,  and  Figure  2.11  gives  the  transform  of  the 
chirp  e’f*  for  these  two  cases.  Notice  that  the  second  transform  broadens  the  chirp 
much  more  than  the  first,  which  should  be  evident  from  the  concentration  ellipses. 
The  opposite  would  be  true  for  the  chirp  These  transforms  are  directionally 

sensitive,  and  using  at  and  a^  as  the  sole  mettsures  of  time-frequency  resolution  is 
obviously  inadequate  in  such  cases. 

Why  consider  transforms  with  such  behavior?  One  answer  is  to  provide  a  general 
treatment  of  time-frequency  locality.  Another  answe*;  is  that  it  is  evidently  possi¬ 
ble  to  obtain  better  time-frequency  resolution  for  some  signals  if  the  transform  is 
directionally  ‘tuned’  to  them  than  otherwise.  This  would  mean  that,  in  general, 


2.8.  Directional  time-frequency  transforms 


48 


Figure  2.10.  Concentration  ellipses  for  transform  kernels  with  complementary 
orientation  selectivity,  (a)  Concentration  ellipse  for  0i(f,w)  =  £-(<’- (w+w  ) 
Concentration  ellipse  for  <i>2{t,u))  = 
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we  would  need  a  family  of  transforms  each  tuned  to  a  preferred  time-frequency 
orientation. 


The  theory  of  directional  transforms  is  greatly  simplified  by  a  rotation  of  co¬ 
ordinates.  Let 

be  the  operator  that  rotates  a  point  6  radians  in  the  time-frequency  plane.  Given  a 
time-frequency  representation  of  a  signal  a:(t),  we  can  consider  the  rotated 

representation  formed  by  the  composition  FzR$(t,w).  Is  this  the  time-frequency 
representation  of  an  actual  signal?  The  answer  is  yes;  if 

00 

io(t)  = - ^  — e'^  /  X(w)e‘^  5  dui,  (2.8.2) 

2itVcos6  J 

-00 

then  Wxf  =  WxRe  [see  Van  Trees  1971).  So  if  F^  has  the  kernel  <i>{t,u>)  and  if  Gj 
has  the  kernel  <l>(t,w)Rs,  then  Gx,  =  FxRe.  In  other  words,  Eq.  2.8.2  rotates  the 
signal  by  6  radians  in  time-frequency,  thus  the  transform  with  the  rotated  kernel 
applied  to  this  signal  will  give  the  desired  effect. 

Relative  to  these  new  co-ordinates  we  can  generalize  some  of  the  measures  of  the 
previous  sections.  For  example,  consider 

00 

^  (2.8.3) 

—00 

This  is  the  marginal  of  the  rotated  transform  along  u>.  It  follows  that  the  time  and 
frequency  marginals  (Eq.  2.4.2)  of  Fx{t,u)  satisfy  wi  =  and  712  =  2;r7rj_„^2. 

If  7rj(f)  =  |ij(f)p,  then  we  will  say  that  the  transform  preserves  the  marginal 
relative  to  the  direction  $  in  time-frequency.  Interestingly,  the  Wigner  distribution 


k. 
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uniquely  meets  this  requirement  for  all  &.  The  proof  is  a  simple  generalization  of 
Cohen’s  result.  He  showed  that  a  shift-invariant  quadratic  transform  perserves  the 
time  marginal,  i.e.,  7ri(f)  =  iff  $(r,0)  =  1  for  all  r.  Using  7\(i>Re\  ~ 

which  is  easily  verified,  it  follows  that  irj(t)  =  iff  9Rg(T,0)  =  1  for  all  r.  This 

implies  that  $(r,  i/)  =  1,  which  corresponds  to  the  Wigner  distribution  by  Eq.  2.5.3. 
This  is  the  reason  for  considering  the  Wigner  distribution  ‘perfectly  localized’  and 
4>{t,  w)  the  ‘point  spread  function’  in  time-frequency. 


The  amount  of  spread  in  time-frequency  direction  9  can  be  measured  by  the  variance 


/  /  t^\^Rg^{t,u})\^  dt  du) 


2  ^  “OO  — o 
00  oo 


f  f  l<i>R0'{t,u!)\^  dtdoj 

—  OO  —00 

In  the  notation  of  the  previous  sections,  ctj  =  £Ts=o,  ^-^d 


2  I  a  ■  a  \  t  Oi^  \  ( COS  0 

<ji  =  (cos0  sin  9)  I  '  2  M  n 

*  ^  '  \citu  (tiy\5Jn5 


(2.8.4) 


(2.8.5) 


Let  be  the  maximum  value  and  o|  be  the  minimum  value  aj.  which  corresponds 
to  the  eigenvalues  of  the  covariance  matrix  in  Eq.  2.8.5.  Further,  let  9‘  be  the  max- 

(COS  \ 

I  of  the  eigenvalue 
sin  9  j 

Oj.  In  other  words,  a\  and  a%  are  the  maximum  and  minimum  dimensions  of  the 
concentration  ellipse  of  <j>(t,w),  and  9’  is  angle  of  the  major  axis  of  concentration 
ellipse  relative  to  the  time  axis.  These  three  quantities  conveniently  specify  the 
time-frequency  locality  of  the  transform. 


In  an  analogous  manner,  we  can  measure  the  smoothness  of  the  transform  in  time- 
frequency  direction  6  by 

y,2  _  ~00  -OO _ 

^9  ”  OO  00 
-OO  — OO 


(2.8.6) 
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In  the  notation  of  the  previous  sections,  =  Es=oi  Si/  =  ^$=rrl2< 

Let  Ej  be  the  maximum  value  and  Ej  be  the  minimum  value  of  Ej,  and  let  0”  be  the 
maximum  direction.  These  three  quantities  conveniently  specify  the  time-frequency 
smoothness  of  the  transform. 

We  are  now  in  a  position  to  generalize  Theorem  E. 

Theorem  F.  If  is  shift-invariant,  then  <7iE2  >  |  and  ooEi  >  with 

equality  in  both  these  relations  iff 

4>R~\t,uj)  X  (2.S.8) 


Proof:  Applying  Theorem  E  to  the  transform  with  kernel  oRJ? ,  we  have  \  < 
(ZjEr  <  crzEi.  Similarly,  with  the  kernel  j  <  (7(E2  <  (T1E2.  The  righthand 

inequalities  are  satisfied  with  equality  iff  9'  =  6“.  It  follows  from  Theorem  E  that 
Eq.  2.8.8  is  a  necessary  and  sufficient  condition  that  all  these  inequalites  are  satisfied 
with  equality.  /// 

Generalizing  Theorem  D  requires  that  we  use  the  directional  variance  of  not 

|0(t,  w)|^  i.e., 

00  00 

f  f  t'^<j)Rj^  [t,w)  dt  duj 

- •  (2.8.9) 

I  f  <l>RJ^[t,i^)dtdw 

—  00  —00 

We  define  ctJ  and  ojj  as  the  maximum  and  minimum  values  of  this  variance,  and 


0*  as  the  maximum  direction. 


Theorem  G.  Let  Fi(t,ui)  be  positive  and  shift-invariant.  Then  aion  > 

Proof:  Apply  Theoreui  D  to  the  signal  and  the  transform  with  kernel  4>RqI  ■ 

/// 

Corollary.  If  Fz(t,u)  is  positive  and  shift-invariant,  then 

Ot  atu  ^  1 

ot  ~  2‘ 

From  Theorem  F,  we  see  that  a  two-dimensional  gaussian  transform  kernel  gives  the 
best  time-frequency  locality  for  a  given  smoothness.  In  this  general  case,  however, 
the  gaussian  kernel  may  be  correlated  in  time  and  frequency,  i.e.  its  concentration 
ellipse  may  be  oriented  obliquely  in  the  time-frequency  plane.  By  specifying  crj 
(=  2c7l),  (=  2ct2)i  O'  for  this  kernel  we  are,  in  effect,  selecting  a  particular 

time-frequency  scale  for  the  transform.  By  Theorem  G,  we  may  choose  any  values 
we  wish  provided  ff/ff//  >  j,  and  the  resulting  transform  will  best  satisfy  all  our 
design  properties. 

When  ajoii  =  j,  this  transform  is  equivalent  to  a  spectrogram  with  a  rotated 
gaussian  window  jcf.  Riley  1983,  Dungeon  1984].  For  larger  values  of  cr/cr/;, 

this  transform  is  equivalent  to  convolving  such  a  spectrogram  with  a  2-D  gaussian. 

2.9.  A  speech  example 

In  this  section  we  examine  a  particular  utterance,  comparing  the  various  signal 
representations  discussed  above.  The  utterance  is  /wioi/  taken  from  “Wc  owe  Eve 
a  dollar”,  as  produced  by  an  adult  male.  This  utterance  has  some  rapid  F2  motion, 
which  makes  it  useful  m  an  example  of  non-stationary  behavior  in  speech. 
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Figure  2.12a,b  show  the  traditional  wideband  and  narrowband  spectrograms  for  this 
utterance.  These  are  spectrograms  computed  with  gaussian  windows  of  standard 
deviation  1  msec  and  15  msecs,  respectively.  The  wideband  spectrogram  shows 
vertical  striations  spaced  at  the  pitch  period.  The  narrowband  spectrogram  shows 
horizontal  striations  spaced  at  the  fundamental  frequency.  They  are  both  due  to  the 
voiced  excitation.  Figure  2.12c  shows  a  spectrogram  whose  window  duration  is  4 
msec,  which  is  intermediate  between  the  previous  two.  This  window  size  is  matched 
to  the  excitation  in  the  following  sense.  The  2-D  gaussian  kernel  (Eq.  2.6.7)  that 
corresponds  to  this  spectrogram  has  standard  deviations  of  2  msec  by  20  Hz.  These 
are  in  the  same  ratio  as  10  msec  and  100  Hz,  the  pitch  period  and  the  fundmental 
frequency,  respectively.  This  choice  gives  rise  to  rows  and  columns  of  sharp  peaks 
and  valleys  spaced  at  the  pitch  period  and  the  fundamental  frequency.  VVe  will  see 
in  the  next  chapter  why  the  excitation  produces  this  particular  structure. 

Figure  2.13  shows  the  Wigner  distribution  for  this  utterance.  Compared  to  Figure 
2.12  it  looks  almost  as  if  the  vertical  scale  has  changed,  but  it  has  not.  This  repre¬ 
sentation  is  dominated  by  cross-terms  that  give  ‘echoes’  of  the  formants  in  initially 
suprising  places.  But  remember  that  the  sum  of  two  complex  exponetials  at  differ¬ 
ent  frequencies  gave  rise  to  a  cross-term  half-way  between  them  that  had  greater 
amplitude  than  the  original  terms  (Figure  2.8).  Evidently,  the  Wigner  distribution 
itself  gives  a  confusing  picture  of  multi-component  signals  such  as  speech. 

Figure  2.14  shows  the  time-frequency  autocorrelation  function,  the  2-D  fourier 
transform  of  the  Wigner  distribution,  for  this  utterance  in  the  neighborhood  of 
the  origin.  Notice  the  repeated  pattern  in  rows  and  columns  spaced  at  the  pitch 
period  and  the  fundmental  frequency.  In  Chapter  3  we  will  see  that  this  pattern 
can  be  exploited  in  understanding  how  to  suppress  the  excitation. 
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Figure  2.13.  Log  magnitude  of  Wigner  distribution.  (This  is  implemented  as  a 
pseudo-Wigner  distribution  using  a  gaussian  window  of  standard  deviation  40  msec 
[see  Claasen  <fe  Mecklenbrauker  1980bj.) 


I . ' 
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Figure  2.14.  Log  magnitude  of  time-frequency  autocorrelation  function  in  the 
vicinity  of  the  origin. 
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Figure  2.15  shows  the  Gaussian  transform  of  this  signal  using  a  kernel  of  a  scale 
chosen  to  suppress  the  excitation.  The  pitch  striations  are  removed,  leaving  smooth 
time-frequency  ridges  that  correspond  to  the  formants.  The  ridges  are  quite  sharp, 
although  it  is  somewhat  difHcult  to  appreciate  this  in  the  half-toned  picture,  Figure 
2.15a.  The  3-D  plot  in  Figure  2.15b  gives  a  different  perspective  on  this  surface. 
It  shows  Fl  and  parts  of  F2  quite  nicely,  although  most  everything  above  2  kHz  is 
considerably  distorted  in  this  presentation. 

Finally,  Figure  2.16  shows  directional  transforms  of  this  utterance  using  oriented 
Gaussian  kernels  matched  to  different  aspects  of  the  signal.  In  Figure  2.16a,  the 
kernel  orientation  is  matched  to  the  rising  F2.  In  Figure  2.16b,  the  kernel  orientation 
is  matched  to  the  falling  F2.  These  choices  bring  out  the  selected  formant  peak  with 
high  resolution. 

In  this  chapter,  we  have  found  that  a  particular  time-frequency  energy  represen¬ 
tation,  the  Gaussian  transform,  best  satihes  a  set  of  properties  deemed  desirable. 
There  are  several  free  parameters  for  this  representation  (<7(,  and  9’),  which  de¬ 
termine  the  scale  and  directional  selectivity  of  the  transform.  Deciding  what  scales 
are  of  interest  requires  a  more  specific  model  of  the  signal.  In  the  next  chapter,  we 
adopt  such  a  model. 


Figure  2.16.  DirectiomI  transforms  using  oriented  Caussian  kernels  matched  to 
different  aspects  of  the  signal,  (a)  Kernel  orientation  matched  to  rising  F2.  (b) 
Kernel  orientation  matched  to  falling  F2. 


Chapter  3. 

Time-frequency  filtering 

In  this  chapter,  we  continue  the  discussion  of  joint  time-frequency  energy  represen¬ 
tations  for  speech  signals.  Here  we  shall  make  stronger  assumptions  about  the  form 
of  the  signals.  We  will  introduce  a  particular  model  of  the  time-varying  vocal  tract, 
and  define  its  ‘transfer  function’,  W'e  will  show  that  time- frequency  filtering 

can  be  used  to  estimate  l/f(f,u;)|*,  a  technique  that  is  essentially  a  two-dimensional 
generalization  of  straight-forward,  stationary  methods.  Further,  we  will  see  that 
|/f(t,w)|^  is  closely  related  to  the  time-frequency  representations  of  the  previous 
chapter. 

3.1.  The  stationary  case 

First,  let  us  re-examine  the  stationary  case.  If  we  adopt  a  more  detailed  model 
of  the  generation  of  a  stationary  speech  signal,  we  can  say  much  more  about  the 
cepstral  methods  discussed  in  the  previous  chapter.  The  linear  model  [Fant  1960; 
Flanagan  1972|  of  vowel  production  begins  by  decomposing  the  speech  signal  into 
a  vocal  source  component  (e.g.  periodic  vocal  fold  vibration)  and  a  vocal  tract 
component,  which  are  treated  as  independent.  The  vocal  tract  is  modelled  as  a 
linear  and  quasi-time-invariant  filter  with  excess  pressure  and  volume  velocity  (of 
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assumed  one-dimensional  wave  motion)  being  analogous  to  voltage  and  current  in 
circuit  theory.  The  distribution  of  the  poles  of  the  filter’s  system  function  constitutes 
the  formant  description  of  the  vocal  tract. 


In  other  words,  the  transfer  function  of  the  stationary  vocal  tract,  can  be 

approximated  by  [Flanagan  1972]  f 

N 

H{iio)  =  [zn/f„(tw)  +  z*/f_„(iu))j ,  (3.1.1) 

n=l 

where  Hn(s)  consists  of  a  simple  pole  at  s„  =  0(„  taj„. 


Hniw)  =  T - 7 - — ^ — r, 

•w  -  (q„  +  ttu„) 


and  is  the  residue  at  the  nth  pole. 


^  _ 

n*^n  [(“i  -  «<>)^  +  +  2iWn(a*  -  Q-n)]  ’ 


We  associate  a  formant  with  each  pole,  or  more  precisely,  with  each  pair  of  poles, 
since  they  occur  in  conjugate  pairs,  i.e.,  5_„  =  «[(,  given  the  impulse  response  of 
the  vocal  tract  is  real.  The  impulse  response  of  the  stationary  vocal  tract,  in  fact. 


is 


where 


(3.1.4) 

(3.1.5) 


In  this  linear  time-invariant  model,  it  follows  that  the  spectrum  of  the  excitation 
and  the  vocal  tract  transfer  fimction  combine  by  multiplication  in  the  power  spec¬ 
trum  and  addition  in  the  log  spectrum.  This  fact  leads  to  a  simple  procedure  for 

t  This  is  the  parallel  formulation.  The  serial  formulation,  ff(iui)  =  hfln  ffn(*<<>)7f-„(iw)  is  also  often 
used.  The  former  is  the  partial  fraction  expansion  of  the  latter. 
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.1. 


separating  the  excitation  and  the  vocal  tract  transfer  function  in  certain  (idealized) 
cases. 


Suppose  the  the  excitation  is  an  impulse  train,  which  is  a  very  simple  model  of 
constant  pitch,  voiced  excitation.  In  this  case,  the  spectrum  of  the  excitation  is  also 
an  impulse  train,  and  thus,  the  speech  spectrum  is  a  uniformly  sampled  version  of 
the  vocaJ  tract  transfer  function.  If  the  sampling  were  unaliased  (i.e.,  the  pitch  is 
low  enough  relative  to  the  highest  transfer  function  quefrencies)  the  original  transfer 
function  can  be  exactly  recovered  by  ideal  low-pass  filtering  the  spectrum,  by  the 
sampling  theorem  [Bracewell  1978].  But  this  is  just  cepstral  smoothing  using,  in  this 
very  idealized  case,  a  rectangular  cepstral  window  jOppenheim  1969;  Oppenheim 
&  Shafer  1975], 

Let  us  examine  this  result  more  closely.  The  formulation  here  will  be  in  terms  of  the 
power  spectrum  and  its  transform,  the  autocorrelation  function,  instead  of  the  more 
usual  log  spectrum  and  its  transform,  the  cepstrum,  since  the  former  generalizes 
more  easily  to  the  time-varying  case.  Since  the  term  ‘cepstral  filtering’  is,  strictly 
speaking,  reserved  for  filtering  operations  on  the  log  magnitude  spectrum,  we  shall 
refer  to  analogous  operations  on  the  power  spectrum  as  autocorrelation  Ritering. 
The  results  in  the  stationary  case  are  similar  in  either  formulation,  t 

If  x(t)  represents  the  excitation,  b.(t)  the  impulse  response  of  the  vocal  tract,  and 
j/(t)  the  output  speech  signal,  then  in  terms  of  power  spectra  and  transfer  function. 


t  Cepatral  and  autocorrelation  filtering  can  both  be  used  to  separate  signal  components  that  arise 
at  different  scales  in  the  frequency  domain.  Cepstral  filtering  is  most  appropriate  when  the  signal 
components  combine  by  convolution  in  the  time  domain,  autocorrelation  filtering  when  they  combine 
by  addition.  Both  approaches  can  be  used  for  speech,  since  we  can  use  either  a  serial  or  parallel 
formulation  of  the  vocal  tract  model 
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|y’(u;)|*  =  |^r(iu;)|*  |^(w)|^,  or  in  terms  of  autocorrelation  fimctions, 

00 

A^ir)  =  j  A4t)A>,[T  -  t)  dt.  (3.1.6) 

—00 

Let  the  excitation  be  an  imptilse  train,  I{t\  T)  =  ~  Then 

=  Y  E  -  A:r).  (3.1.7) 

ik=— 00 

Thus  from  Eq.  3.1.6,  we  have 

=  (3.1.8) 

Provided  the  duration  of  Ai^^t)  is  small  enough  that  the  terms  in  Eq.  3.1.8  do 
not  overlap,  /t^(t)  2Lnd  thus  (/r(tw)|*  can  be  recovered  by  windowing  >ly(r)  with  a 
rectangular  window  centered  on  the  origin  and  of  duration  T  (see  Figure  3.1). 


Let  us  examine  the  form  of  Ah{T).  Assume  for  now  that  the  vocal  tract  transfer 
function  consists  of  only  a  single  pole,  i.e.,  its  impulse  response  has  the  form  of 
Eq.  3.1.5.  Then 

oa 

A/,„(r)=  J  e'"^’’‘''‘^u(r  +  t)«*’**u(t)  dt 


OO 

=  e'”’’  J  e^“"‘u(r  +  t)u(t)  dt 

— OO 

00 

=  e'»‘  j 


max(— r»0) 


an\T\.tWnT 


(3.1.9) 


where  0„  =  — 2an  is  the  (half-power)  bandwidth  of  the  pole.  Thus,  provided  this 
bandwidth  is  large  enough,  the  overlap  in  the  terms  in  Eq.  3.1.8  will  be  negligible, 
and  windowing  Ay(r)  will  very  nearly  recover  A^(r)  and  hence  |/f(tw)|*  .  t 

t  The  phase  of  the  transfer  function  can  be  found,  if  desired,  front  its  magnitude,  since  this  model  is 
minimum  phase  [see  Oppenheim  &  Shafer  1975]. 
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Figure  3.1.  Recovering  the  transfer  function  by  autocorrelation  Sitering,  (a) 
Spectrum  of  the  excitation  modelled  as  an  impulse  train  (10  msec  period),  (b) 
Square  magnitude  of  the  transfer  function,  which  in  this  simple  example  is  a  single 
pole  of  300  hz  bandwidth,  (c)  Power  spectrum,  the  product  of  ‘(a)’  and  ‘(b)'. 
Cepstral  Sitering  uses  the  log  spectrum  instead.  The  approach  here  generalizes 
more  easily  to  the  time-varying  case,  (continued...) 
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(e) 

Figure  3.1  (continued).  Recovering  the  transfer  function  by  autocorre/ation 
fiitering.  (d)  Magnitude  of  the  autocorrelation  function,  the  (inverse)  fourier  trans¬ 
form  of  ‘(c)’.  Dashed  lines  show  the  rectangular  window,  (e)  Fourier  transform 
of  the  windowed  autocorrelation  function,  which  very  nearly  recovers  the  transfer 
function  ‘(b)’  in  this  idealized  case  (the  effect  of  the  slight  overlap  of  the  terms  in 
‘(d)’  is  negligible). 


The  analysis  of  the  multiple  pole  case  follows  from  superposition.  Provided  the 

poles  are  not  closely  spaced  relative  to  their  bandwidths,  t 

N 

n=l 


X  The  analysu  in  terms  of  log  spectra  and  cepstra  does  not  require  this  proviso,  since  convolutions  in 
the  time  domain  transform  (exactly)  to  sums  in  the  cepstral  domain.  This  is  an  advantage  of  the 
cepstral  approach. 
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from  Eq.  3.1.1  and  Eq.  3.1.2,  hence 


«=i 


(3.1.12) 


from  Eq.  3.1.9.  From  this  equation  and  Eq.  3.1.8,  we  see  that  windowing  the 
autocorrelation  function  of  the  output  speech  signal  can  still  be  used  to  recover  the 
transfer  function  when  the  bandwidths  are  large  enough  that  aliasing  is  negligible. 


A  few  changes  to  this  model  make  it  more  realistic.  First,  the  spectrum  of  constant 
voiced  excitation  is  somewhat  better  modelled  as  an  impulse  train  that  drops  off 
at  12DB  per  octave  [Flanagan  1972j.  This  trend  can  be  removed  by  spectral  pre- 
emphasis. 


Second,  the  sampling  is  usually  significantly  aliased,  which  is  a  more  serious  prob¬ 
lem.  In  this  case,  we  can  recover  only  a  low-pass  version  of  the  transfer  function. 
rectangular  window  is  a  poor  choice  in  this  case,  since  its  transform  rings  for  a  con¬ 
siderable  duration  in  the  frequency  domain.  The  gaussian  is  a  good  choice,  because 
it  has  minimal  bandwidth  for  a  given  window  duration,  as  indicated  in  the  previous 
chapter,  (see  Figure  3.2).  Typically,  the  standard  deviation  of  the  gaussian  window 
is  selected  about  equal  to  the  pitch  period. 

3.2.  Non-stationary  vocal  tract 

Let  us  now  consider  the  case  where  the  vocal  tract  configuration  is  not  necessarily 
static.  The  goal  is  to  recover  the  “time-varying  transfer  function"  of  the  vocal  tract 
from  the  signal  and  remove  the  excitation,  as  we  did  in  the  stationary  case. 

Unfortunately,  there  is  no  widely  accepted,  satisfactory  definition  of  the  transfer 
function  for  a  time-varying  linear  filter,  although  there  have  been  many  proposals 
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Figure  3.2.  Estimating  ‘aliased’  transfer  function,  (a)  Spectrum  of  excitation 
modelled  as  an  impulse  train  (10  msec  period),  (b)  Square  magnitude  of  the  transfer 
function,  a  single  pole  of  150  Hz  bandwidth.  This  has  higher  ‘quefrencies’  than 
the  previous  example;  ‘(a)’  undersamples  it  in  this  case,  (c)  Power  spectrum,  the 
product  of  ‘(a)’  and  ’(b)’.  (continued...) 
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Figure  3.3  (cootinued).  Estimating  'aliased'  transfer  function  I'd!  M.igrr.' udr 
of  the  autocorrelation  function,  the  (inverse)  fourier  transform  of  'id'  l)i)tt,  d  .';nc 
show  the  gaussian  window,  (ej  Fourier  transform  of  the  windowed  aiitororndat lun 
function,  which  recovers  a  low-pass  version  of  the  transfer  function  'ih-' 

e.g.,  see  Lui  1971,  Loynes  196S:  Page  I9.')2:  .''aleh  A'  SuhoUr  i  r.ii 

We  shall  avoid  this  difficulty  by  constraining  the  form  of  the  transfer  f  iiKti'in  .ve 
shall  allow  non-stationarity.  but  only  in  certain  well-fiehaved  wavs 

The  vocal  tract,  of  course,  is  not  an  arbitrary  time-varvmg  filler  it  is  ■  on«'r.i  ■  ■■': 
by  the  physical  properties  of  the  articulators  Josha  l'W*2.rWt  has  invest  igate<i  'tie 
physics  of  the  non-stationary  vcK-al  tract  analyticallv,  and  '"  nid  that  in  ier  .i  r’.i  : 
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reasonable  physical  assumptions  it  is  possible  to  generalize  the  notion  of  a  formant 
to  the  time-varying  case.  Essentially,  he  replaces  the  assumption  of  a  static  vocal 
tract  configuration  by  the  assumption  that  the  deformations  are  slow  enough  to 
satisfy  the  condition  of  adiabatic  approximation,  which  he  indicates  appears  to  be 
generally  valid  from  cine  X-ray  measurements. 

We  can  thus  define  the  impulse  response,  h{t,a),  for  a  time- varying  “resonance”  of 
the  vocal  tract  to  an  impulse,  S(t  —  a),  at  time  a  as; 

h(t,a)  =  -  a),  (3.2.1) 

where  we  assume  the  formant  bandwidth  0o  's  fixed,  and  the  formant  center  fre¬ 
quency  is  Wo  at  t  =  0.  Note  that  Eq.  3.2.1  reduces  to  the  usual  definition  of  the 
impulse  response  of  a  formant  if  the  time-varying  modulation  frequency,  7(1),  is 
zero. 

In  Josha's  model,  the  bandwidth  varies  somewhat  with  rate  of  change  of  vocal  tract 
area,  which  we  shall  treat  as  negligible.  Regarding  these  bandwidth  variations,  Fant 
'1980]  believes  they  “...are  of  academic  rather  practical  significance.  Of  greater 
importance  is  probably  the  mere  fact  that  a  rapid  transition  of  a  formant  creates  a 
special  perceptual  ‘chirp’  effect." 

It  will  be  convenient  to  examine  a  more  general  class  of  impulse  responses  than  in 
Eq.  3.2.1.  Consider  the  impulse  response 

M*.®)  =  (3-2-2) 

where  ho(<)  is  the  impulse  response  of  a  linear  time- invariant  (LTI)  system  and 
7(0)  =  0.  Eq.  3.2.1  has  this  form  with  ho(t)  =  We  call  this  a 

frequency-modulited  SIter.  We  shall  study  this  kind  of  filter  in  the  next  several 
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sections,  since  it  is  possible  to  generalize  the  notion  of  a  transfer  function  for  it 
and  it  is  possible  to  estimate  this  transfer  function  by  generalizing  the  “cepstral” 
methods  described  above.  Of  course,  an  FM  filter  models  only  a  single  pole;  we 
shall  take  up  the  multiple  pole  model  of  the  complete  vocal  tract  transfer  function 
in  a  later  section. 

How  then  can  we  represent  the  time-varying  transfer  function  of  an  FM  filter?  An 
intuitively  appealing  candidate  is 

-  im^  (3-2.3) 

where  is  the  transfer  function  of  the  corresponding  stationary  filter  with 

impulse  response  /io(f)  (Eq.  3.2.2).  In  terms  of  how  we  might  want  to  visualize 
the  transfer  function  of  an  FM  filter,  this  seems  attractive;  it  is  just  the  stationary 
transfer  function  shifted  at  each  time  by  the  local  modulation  frequency  7(1).  For 
a  time-varying  formant  pole,  would  have  the  form  of  a  stationary  pole  in 

each  frequency  cross-section  with  center  frequency  wq  -t-  7(<)  and  fixed  bandwidth 

00. 

For  our  purposes,  the  most  important  properties  that  the  definition  of  the  time- 
varying  transfer  function  of  a  formant  should  satisfy  are  practical  ones  —  it  should 
provide  phonetically  relevant  information  about  the  signal,  and  it  should  be  com¬ 
putable  from  the  signal.  The  representation  in  Eq.  3  2.3  satisfies  these  properties 
since  it  is  a  simple  generalization  of  the  stationary  case,  which  is  already  understood, 
and  it  can  be  estimated  from  the  signal  by  methods  we  will  describe  shortly. 


The  transfer  function  of  an  LTI  filter,  however,  also  has  some  nice  theoretical  prop¬ 
erties  that  would  be  desirable  when  generalized  to  the  time-varying  case.  In  partic¬ 
ular,  the  transfer  function  Ho{m)  of  an  LTI  filter,  y(a:)  =ro[x(t)j:  (1)  specifies  the 
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eigenvalues  for  the  filter’s  eigenfunctions,  i.e., 

ro[e"‘]  =  (3.2.4) 

and  (2)  is  the  ratio  of  the  spectrum  of  the  output  over  the  spectum  of  the  input, 
i.e.. 

The  first  property  does  generalize  to  the  FM  case.  Consider  the  functions 

(3.2.6) 

These  are  the  eigenfunctions  for  an  FM  filter  T,  with  impulse  response  defined  by 
Eq,  3.2.2.  This  follows  from 

00 

T[Puit)]  =  J  h{tf  a)pi^(a)  da 

-<x 

oo 

— OO 

00 

=  e‘/;7(r)4r|  ho(t  -  da 

—  OO 

=  e‘/.'^W‘‘^J7o(.w)e"‘ 

=  ffo(«w)»?„(t).  (3.2.7) 

Further,  we  see  from  Eq,  3.2.7  that  Ho[ia>)  specifies  the  eigenvalues  for  the  eigen¬ 
functions  The  value  of  Ho{iu),  however,  depends  on  the  choice  of  the  time 

origin.  More  generally, 


T[<p^{t)\  =  H(Q,w)<Pu{t) 


(3.2.8) 
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By  comparison,  some  authors  have  used 

oo 

j  (3.2.9) 

— OO 

as  their  definition  of  the  time-varying  transfer  function  [e.g.,  Zadeh  1950].  The 
filter’s  response  to  a  complex  exponential  e*“*  is  .  However,  is  not,  in 

general,  an  eigenfunction  of  a  time-varying  system,  consequently  H(t,  u»)  heis  limited 
use. 


Saleh  ic  Subotic  [1985]  have  explored  generalizing  the  second  property  (Eq.  3.2.5) 
to  the  time-varying  case.  They  suggest  using 

as  the  definition  of  the  time-varying  transfer  function  where  Fx{t,  uj)  and  Fy(t,  ui)  are 
joint  time-frequency  representations  of  the  input  and  output  signals,  respectively. 
The  difficulty  with  their  approach  is  that  the  ratio  in  Eq.  3.2.10,  in  general,  will 
have  different  values  for  different  inputs  x(t)  for  a  given  filter,  unlike  the  LTI  case 
(Eq.  3.2.5).  This  second  property  evidently  does  not  generalize  well  to  the  time- 
varying  case. 

3.3.  Time-frequency  filtering 


The  remainder  of  this  chapter  is  used  to  show  that  time-frequency  filtering  can 
be  used  to  estimate  the  transfer  function  of  FM  filters  and,  more  generally,  of  the 


t  I.e.,  suppoM  i  =  t  —  r.  Let  and  ffofiu)  be  the  time-varying  transfer  function  and  the 

correeponding  LTI  tranafer  function,  respectively,  in  the  new  lime  co-ordinate.  Then,  - 

ff(t  +  r, ui)  and  ffo(iui)  =  /fo|«(<v  -  l(r))|  =  H{t,w)  =  ff(0,u). 


Time- frequency  filterim 
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time-varying  vocal  tract.  Time-frequency  filtering  consists  of  multiplying  the  time- 
frequency  autocorrelation  function  Ax{t,i/)  (Eq.  2.5.5)  of  the  signal  x(t)  with  a  2-D 
window  $(r, I/).  The  2-D  inverse  fourier  transform  of  this  windowed  function, 

T~'^[i[T,i/)Ax[T,u)\,  (3.3.1) 

becomes  the  filtered  time-frequency  representation.  The  shape  of  the  window,  of 
course,  determines  what  energy  is  kept  and  what  is  removed  in  the  filtered  repre¬ 
sentation  [cf.  Flandrin  1984]. 

This  technique  is  in  many  ways  the  time-varying  generalization  of  the  “cepstral” 
methods  presented  in  Section  3.1.  The  time-frequency  autocorrelation  takes  the 
place  of  the  autocorrelation  function,  a  2-D  window  the  place  of  a  1-D  window,  and 
a  2'D  inverse  fourier  transform  of  a  1-D  fourier  transform  in  this  generalization. 

The  representation  in  Eq.  3.3.1  also  specifies  a  general  member  of  the  quadratic 
transforms  presented  in  the  previous  chapter,  indicating  that  the  two  chapters  are 
related,  fn  this  chapter,  our  goal  is  to  show  that  a  member  of  this  class  can  give  a 
good  estimate  of  the  time-varying  “transfer  function”  of  the  vocal  tract  Happily,  it 
turns  out  that  the  form  of  time-frequency  window  $(t,  u)  that  gives  a  good  estimate 
is  a  2-D  gaussizm,  which  is  the  same  as  Eq.  2.6.7.  In  other  words,  we  end  up  with 
the  same  kind  of  time-frequency  representation  as  in  the  previous  chapter,  which 
was  based  there  on  weaker,  but  more  general  goals. 

The  results  of  this  chapter,  then,  reinforce  and  reinterpret  those  of  the  previous 
chapter.  Further,  the  analysis  here  suggests  which  scales  to  choose,  decisions  that 
were  free  parameters  of  Chapter  2.  In  particular,  for  voiced  speech,  o,  is  matched 
to  the  pitch  period,  and  £T,„  is  matched  to  the  fundamental  frequency 


3.4.  The  stationary  case  —  re-examined 
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We  have  just  given  the  basic  result  of  this  chapter.  It  remains  to  demonstrate  its 
validity,  i.e.,  that  this  kind  of  filtering  will  give  a  good  estimate  of  the  time-varying 
vocal  tract  “transfer  function”.  This  requires  several  steps  in  which  we  gradually 
generalize  the  form  of  the  filter  that  models  the  vocal  tract.  In  Section  3.4,  we 
re-examine  the  stationary  case,  this  time  in  terms  of  the  time-frequency  autocorre¬ 
lation  function.  In  Section  3.5,  we  consider  FM  filters  that  have  a  linearly  varying 
modulation  frequency.  In  Section  3.6,  we  use  a  locality  argument  to  generalize  these 
results  for  quasi-stationary  filters  and  for  FM  filters  that  have  a  smoothly  varying 
modulation  frequency,  respectively.  In  Section  3.7,  we  use  a  superposition  argument 
to  treat  the  multiple  pole  case. 

3.4.  The  stationary  case  —  re-examined 

So  let  us  assume  for  now  we  want  to  estimate  the  transfer  function  of  a  filter  that  is 
time-invariant.  We  will  show  how  the  time-frequency  autocorrelation  function  can 
be  used  to  produce  this  estimate. 

This  will  really  Just  be  recapitulation  of  the  stationary  argument  presented  in  Sec¬ 
tion  3.1.  In  fact,  i4a(r,0)  =  y4a(r),  so  we  see  the  correspondence  is  very  close. 
But  with  the  time-frequency  autocorrelation  function  we  will  be  in  a  position  to 
generalize  these  results  to  the  time-varying  case,  so  it  is  worth  the  effort. 

Letting  x[t)  represent  the  filter  input,  h{t)  the  filter’s  impulse  response,  and  y{t) 
the  output,  we  have 

00 

=  j  (3.4.1) 

~00 

In  other  words,  the  time-frequency  autocorrelation  function  Ay{T,i/)  consists  of  the 
convolution  of  Az{t,i/}  and  4*(r,i/)  along  the  t  dimension.  This  is  analogous  to 


L. 
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Eq.  3.1.6. 

Let  the  filter  input  be  an  inapulse  train  I(t;T)  =  #(t  -  nT).  Then 

oo 

A/(r,u}=  I  e-^*'^6(t~nT  +  Tl2)Y^6{t-mT -rj2)dt. 

-00  •  "* 

Substituting  t'  =  t  -  j(m  +  n)r  and  r'  =  t  +  (m  —  n)r, 

=ee|/  e-'‘'*‘6{t'  +  T'/2)6(t'  -  t'/2)  rff' | 

The  quantity  in  braces  is  the  time-frequency  autocorrelation  function  of  an  impulse 
6{t'),  which  is  =  6{t')  [see  Classen  &  Mecklenbrauker  1980a].  Thus, 

Ai{t,u)  =  + 

n  m 

Letting  k=n-m, 

«  k 

The  quantity  in  braces  is  the  fourier  transform  of  an  impulse  train  /(t;  T),  which  is 
itself  an  impulse  train  ^Iiy\  [see  Bracewell  1978].  Therefore, 

-  kT)6{u  - 

k  n 

-1)'**5(7-  -  kT)6{w  ~  (3.4.2) 

k  n 

Eq.  3.4.2  shows  that  the  time-frequency  autocorrelation  function  of  an  impulse 
train  is  a  rectangular  grid  of  impulses  spaced  T  apart  along  t  and  27r/r  apart  along 
1/  (see  Figure  3.3).  t  Eq.  3.4.2  is  the  two-dimensional  analog  of  Eq.  3.1.7. 

t  Siebert  (19S6|  has  derived  the  time-frequency  autocorrelation  function  for  a  train  of  pulses  of 
arbitraxy  shape,  a  result  that  is  Important  in  the  theory  of  radar.  The  above  result  follows  formally 
from  this  if  the  pulses  are  given  unit  area  and  approach  sero  width  in  the  limit. 


?3.4.  The  stationary  case  —  re-examined 
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Figure  3.3.  Magnitude  of  the  time-frequency  autocorrelation  function  of  an  im¬ 
pulse  train  (10  msec  period). 


Figure  3.4.  Magnitude  of  the  time-frequency  autoco.  relation  function  of  the  out¬ 
put  of  an  LTI  Biter  excited  by  an  impulse  train.  In  this  simple  example  the  filter 
consists  of  a  single  pole  of  300  hz  bandwidth. 


S3.4.  The  stationary  case  —  re-exam/ned 
From  Eq.  3.4.1,  we  have 
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)6fv-  2=),  (3.4.3) 

*  » 

the  two-dimensional  analog  of  Eq.  3.1.8.  Ay{T,u)  consists  of  a  rectangular  grid  of 
shifted  T  slices  of  A\{T,t/)  (see  Figure  3.4). 

Provided  the  terms  in  Eq.  3.4.3  do  not  overlap,  Ah(T,0)  can  be  recovered  from 
Ay(r,i/)S{u)  by  windowing  it  with  a  rectangular  window  that  is  centered  on  the 
origin  and  that  has  length  T,  width  2ir/r,  and  height  r/27r  (see  Figure  3.5).  From 
.4h(r,0)^(r)  we  can,  in  turn,  recover  |ff(ta;)|^,  since 

00  oo 

I-°°\AK(r,0)6(u)]=^-^  J  J  AH{T,0]S(u)e'^’'*-^'^Urdu 

— OO  -00 

00 

=  J  WnMdt 

-00 

=  lfl-(«w)|*.  (3.4.4) 


On  the  other  hand,  if  the  terms  in  Eq.  3.4.3  do  overlap  somewhat,  then  a  low-pass 
version  of  |.ff(tw)|*  can  still  be  recovered,  since 


J  *[$(T,i/)i4y(T,i/)) 


7  '  ^(T,t')^44fc(r,0)«(i/) 


(3.4.5) 


where  $(r,  j/)  is  the  time-frequency  window,  and  (6(t,  w)  is  its  two-dimensional  in¬ 
verse  fourier  transform.  In  this  case,  using  a  rectangular  window  on  the  time- 
frequency  autocorrelation  function  is  a  poor  choice  since  its  transform  rings  for  a 
considerable  duration  away  from  the  origin.  A  gaussian  window  minimizes  this 
problem. 


3.4.  The  stationary  case  —  re-examined 
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Figure  3.5.  Rectangular  window  (very  nearly)  recovers  ‘unaliased'  transfer  func¬ 
tion.  (a)  Windowed  time-frequency  autocorrelation  function  in  Figure  3.4.  (b) 

Square  magnitude  of  transfer  function,  the  2-D  invei.-e  fourier  transform  of  '(a)'. 
In  the  ‘aliased’  case,  i.e.,  if  the  terms  in  Figure  3.4  were  to  overlap  significantly,  a 
gaussian  window  would  be  more  appropriate. 


3.5.  Linearly  varying  modulation  frequency _ ^ 

Let  us  examine  the  form  of  A*,(r,i/)  assuming  for  now  that  the  fiher  consists  of  only 
a  single  pole,  i.e.,  its  impulse  response  has  the  form  of  Eq.  3.1.5.  Then 


00 

—  00 

00 

=  «"“’■  I  e=*""‘u(<  -  |r|  /2)e-’*'‘  dt 
— oo 

j(a.-<»//2)|r|jtu«r 

0n  +  ‘V 


This  last  equation  is  the  two  dimensional  analog  of  Eq.  3.1.9. 


(3.4,6) 


Thus,  provided  the  pole  bandwidth  is  large  enough,  windowing  >ly(r,i^)  can  recover 
most  of  A|^{T,  L>),  and,  hence,  a  low-pass  verrion  of 

3.5.  Linearly  varying  modulation  frequency 


VVe  now  consider  the  case  where  we  want  to  estimate  the  transfer  function  of  an  FM 
filter  that  has  a  linearly  varying  modulation  frequency,  i.e.,  ^(l)  =  mt  in  Eq.  3,2.2. 
This  means 

h(t,a)  =  (3.5.1) 

The  previous  section  was  the  special  case  m  =  0. 


Let  us  find  how  passing  a  signal  through  such  a  filter  modifies  its  time-frequency 
autocorrelation  function.  As  usual,  we  let  i(()  represent  the  input  to  the  filter  and 
y(t)  the  output.  Thus, 

oo 

v(0  =  j  x{a)h{t,a)da 
-00 

oo 

=  e*l"“’  I  i(o)e-‘i”“>X(«  -  a)  da.  5  2^ 

—  OO 


^3.5.  Linearly  varying  modulation  frequency 


Letting  i{t)  =  i(/)e  '5'"’  and  y(0  =  y(Oe  .  we  have  fro-n  Eq.  3.5.2  and 


Eq.  S.-l.l, 


X 


In  other  words,  the  time-frequency  autocorrelation  function  of  y{t)  consists  of  the 
convolution  of  the  time-frequency  autocorrelation  of  x(t)  and  /io(f)  along  the  r 
dimension. 


\\e  are  more  directly  interested  in  Ai  and  A^.  than  .-f^  and  A^.  But  this  last 
transformation  in  simple,  since  the  time-frequency  autocorrelation  function  has  thi,- 
following  nice  property:  if  fftl  rifle  then  \'an  Trees  1071 

,4j|r,t/)  =  ,4,(r,i/  •  mr).  (3.5.41 

in  other  ivord.s.  multiplying  a  .signal  Viy  a  linear  chirp  .shears  its  time-frequency 
autocorrelation  function  along  the  c'  dimension  (see  Figure  3.6). 


t'omhlning  Fiq  3. .5. 3  and  Eq.  3.5  4.  we  see  that 
cc 

.4,(r,t')-  y.4ilf.i'-»Ti(f-r)).4i,„(r-f,i/-mr)cff.  (3.5.5) 

.X 

In  words,  the  time-frequency  autocorrelation  function  of  a  signal  passed  through  the 
filter  in  Eq  3  5  1  can  be  found  by  first  shearing  its  input  time-frequency  autocor¬ 
relation  function,  convolving  that  with  the  time-frequency  autocorrelation  function 
of  h.  Ul.  and  then  shearing  the  output  time-frequency  autocorrelation  function  in 
the  opposite  direction,  all  with  respect  to  the  u  dimension  (see  Figure  3.7). 


\S  hen  the  filter  input  is  an  impulse  train  l{t.T).  the  filter  output  is 


‘  r,lkT]f( 

*  S 


I'  m(r  -kT)-^^).  (3.5.6) 


i3  6  The  ciuaai-stationarv  caar 
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'I  .in  f  \f  lilirr  »ir/i  linvarly  carving  moilnlation  slope  IK)  Ih,  msec /  ewiii’O  .‘rv 
■in  .inpnlse  triii/i  IK)  msec  period)  In  this  example,  the  corresponding  LTI  liiii'r 
'on.sisli  of  a  single  pole  of  3l)t)  hz  bandwidth. 


vpruon  of  lilt,  ^  can  !«till  bp  recovered,  since 


‘  ♦(r.i/)v4,(r,l/)  =:  /■' 


=  i<^(f.u;)  ••  -  mO)  “ 


(3.5.8) 


where  <t(r,r/)  is  the  time-frequency  window,  and  4i{t.u/)  is  its  inver.se  fourier  trans¬ 
form.  A  2-D  gaussian  window  is  used,  and  its  dimensions  are  matched  to  the  period 
T  and  the  fundamental  frequency  2-k/T,  respectively  (see  Figure  3.10), 


3.6.  The  ouasi-stationarv  case 


83 


200  — 

100  — 

Hz  0  — 

-100  — 

"l 

-200  — 1 


(a) 


i-r-TTr^Tn  r . . 

-0.02  -0.01  0  OOl  0.02  sec 


4000-; 

JSOO 

ji<« 

ifia 

2$lt 

IStI 

!••• 

' ' ' '  I ' ' ' '  I ' ' ' '  I ' ' ' '  I ' ' ' '  I ' ' ' '  I ' ' ' '  I ' ' '  ''I 

«  t.tts  t-ts  (.(TS  •.!  i.lZS  I.IS  1.175  1.2 

Figure  3.9.  Rectangular  window  (very  nearly)  recovers  ‘unaliased'  transfer  func¬ 
tion.  (a)  Windowed  time-frequency  autocorrelation  function  in  Figure  3.8.  (b) 
Square  magnitude  of  transfer  function,  the  2-D  inverse  fourier  transform  of  ‘(a)’.  In 
the  ‘aliased’  case,  i.e.,  if  the  terms  in  Figure  3.8  were  to  overlap,  a  gaussian  window 
would  be  more  appropriate. 


^3.6.  The  audsi-stationary  case 
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So  far,  we  have  shown  that  the  time-frequency  filtering  can  be  used  to  esiirniitc  tin' 
transfer  function  of  two  kinds  of  linear  filters  —  time-invariant  and  FM  liltcrs  wit  ti 
linearly  varying  modulation  frequency.  We  now  show  that  more  general  raset.  will 
follow  from  the  time  locality  of  this  operation. 

3.6.  The  quasi-stationary  case 

We  next  consider  the  quasi-stationary  case  in  which  the  vocal  tract  changes  slow  Is 
over  time.  The  traditional  way  to  deal  with  this  situation  is  to  extend  the  stationary 
arguments  (Section  3.1)  by  substituting  the  short-time  spectrum  for  the  spectrum 
of  the  entire  signal.  There  are  thus  two  windows  involved  in  this  analysis; 
spectrogram  window,  tt;s(t),  and  the  autocorrelation  function  window',  u  ,, i  r 

The  ‘two-dimensional’  approach  that  we  have  outlined  above  extends  direi  liy  w  it 
out  the  need  of  an  additional  window.  In  fact,  the  estimate  of  •  isaposi'iM 

representation  of  the  signal  energy 

so  from  Eq.  2.6.5  we  know  that  lH(to.w)l^  effectively  depends  only  on  signal  values 
within  a  few  <xt  of  tp.  t  Provided  the  quasi-stationary  signal  does  not  change  tnuc  !•. 
over  this  interval,  the  stationary  results  of  Section  3.4  generalize  immediately. 

These  two  approaches  for  quasi-stationary  signals,  the  former  using  a  1-D  window . 
ws(t),  on  the  signal  and  a  l-D  window,  wa(t)  on  the  autocorrelation  function,  and 
the  latter  using  a  single  2-D  window,  $(r, i/)  on  the  time-frequency  autocorrela¬ 
tion  function,  are  related.  In  fact,  4>(t,w)  =  i4u,,(r,i/)tu2(r).  The  latter  approach 
specifies  the  time  and  frequency  scale  of  interest  independently  with  each  of  the 
dimensions  of  the  window  $(t,j/).  This  is  somewhat  cleaner  than  the  former,  which 
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where  hn{t,a)  is  the  impulse  response  of  each  pole,  Hq  ,12  1  |i  f  f.q 
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If  the  spectral  shaping  of  the  first  transmission  channel  is  gradual,  i.e.,  r(t)  is  of 
short  duration,  then  from  Eq.  3.9.1,  Wp[t,io)  w  \R[iui)\^Wy{t,ui).  If  the  gain  varia¬ 
tions  of  the  second  transmission  channel  are  slow,  then  from  Eq.  3.9.2,  Wg{t.  ui]  ^ 
s(f)  u).  It  follows  from  these  equations  and  Eq.  3.5.8  that 

7-‘  \*{r,u)Ap{r,u)\  «  u;)|fi(.u;)l*|//(f,  c^)  |^  (3.9.3) 

and 

=  i«>((.w)  ..  \  (3.9.4) 

Thus,  these  simple  kinds  of  transmission  channels  have  simple  effects  of  the  tran.sfer 
function  estimate.  The  broadband  LTI  channel  essentially  shapes  the  estimate's 
freijuency  slices  and  the  slowly  varying  gain  channel  shapes  its  time  slices. 

3.10.  The  excitation 

rp  to  noss,  we  have  a.ssumed  the  filter  excitation  lias  been  an  impulse  Irani  W,. 
consider  more  general  (and  realistic)  forms  of  excitation  in  this  section. 

We  can  create  a  general  periodic  excitation  from  an  i.Tipulse  train  by  passing  it 
through  a  LTI  filter  whose  impulse  response  r(()  has  the  excitation's  pulse  shaiie. 
The  output  can  then  be  passed  through  the  time-varying  filter  k[t,a).  Provided 
the  spectral  shaping  by  r(t)  is  gradual,  i.e.,  r(t)  is  of  short  duration,  then  these  two 
filtering  operations  will  commute.  The  assumption  is  that  the  time-varying  filter 
can  be  considered  quasi-stationary  over  the  duration  of  r(t).  This  is  a  reasonable 
assumption  for  the  gradual  spectral  rolloffs  produced  in  speech  excitation.  Since 
these  two  operations  commute  under  these  circumstances,  the  effect  of  the  filter  r(() 
on  the  transfer  function  estimate  is  given  by  Eq.  3.9.3» 

Similarly,  slowly  varying  changes  in  the  amplitude  z(t)  of  the  excitation  will  result 
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in  corresponding  changes  in  the  amplitude  of  the  filter  output,  with  the  effect  on 
the  transfer  function  estimate  given  by  Eq.  3.9.4.  The  pitch  period  need  not  be 
constant,  either.  Using  the  locality  arguments  again,  we  only  require  that  the  pitch 
period  changes  slowly. 

Finally,  consider  the  case  where  the  filter  is  noise-excited.  Martin  A:  Flandrin 
[1985]  discuss  using  time-frequency  filtering  as  a  general  approach  for  analyzing 
non-stationary  random  signals.  Our  model  here  involves  not  only  non-stationarity, 
but  also  noise  that  is  not  additive,  and  a  careful  theoretical  analysis  of  this  case  has 
not  been  attempted  yet.  We  must  be  content,  for  now,  with  the  following  comment. 
We  have  seen  in  the  previous  chapter  that  these  methods  can  be  used  to  select  time 
and  frequency  scales  that  remove  the  fine  structure  introduced  i)y  tlu'  i’\(  itai  im. 
This,  of  course,  remains  true  for  this  case. 


Chapter  4. 

The  Schematic  Spectrogram 


4.1.  Rationale 


In  the  previous  chapters  we  have  seen  how  to  obtain  a  well-behaved  representation  of 
the  the  speech  energy,  with  a  choice  of  the  time  and  frequency  scales  of  interest.  For 
the  next  step  we  are  faced  with  a  methodological  decision.  If  we  are  willing  to  make 
strong  assumptions  about  the  signal  early  on,  then  we  can  use  those  constraints 
in  some  detection  scheme.  For  example,  one  can  assume  the  spee  h  spectrum  is 
composed  of  a  number  of  poles,  and  use  analysis-by-synthesis  or  linear  predictive 
coding  methods  to  fit  these  poles  to  the  spectrum  in  a  formant  analysis. 


In  this  approach,  a  synthetic  multiple  pole  spectrum  is  fit  to  each  short-time  spec¬ 
trum.  Typically,  the  pole  frequencies  can  be  varied,  but  for  tractability  the  num¬ 
ber  of  poles  and  their  bandwidtha  are  held  fixed.  Stevens  k  House  [1955]  and 
Olive  [1971],  for  example,  computed  mean-square  difference  between  log-magnitude 
short-time  speech  spectra  and  a  function  of  the  form: 
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(4.1.1) 


The  poles  of  the  synthetic  spectrum  that  is  found  to  have  the  least  RMS  error 
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are  taken  to  be  the  formants.  The  permissible  range  for  each  of  the  poles  is  often 
restricted  to  the  typical  ranges  for  the  corresponding  formants  in  this  method. 
Different  versions  of  this  method  are  identified  by  the  search  strategy  used  to  find 
the  best  match.  Some  have  used  exhaustive  search  [Stevens  At  House  1955;  Bell, 
et  aJ  1961;  Matthews,  et  at  1961],  so-called  analysis-by-synthesis.  01ive[l97lj  used 
hill-climbing  techniques.  Linear-predictive  coding  can  be  viewed  as  fitting  a  fixed 
number  of  poles  to  short-time  spectra,  using  a  slightly  different  spectral  distance 
measure  than  RMS  distance  [Atal  1971;  Markel  At  Gray  1976).  The  great  advantage 
of  LPC  is  that  it  provides  a  simple  closed-form  solution  to  the  search  for  an  optimum 
fit. 

One  problem  with  this  approach,  as  stated,  is  that  it  depends  on  the  quasi-stationary 
assumption.  The  short-time  spectral  contribution  of  a  formant  in  rapid  motion  is 
poorly  modelled  as  a  pole  with  a  bandwidth  appropriate  for  a  stationary  formant. 
Even  when  the  bandwidths  are  variable,  as  in  the  LPC  technique,  the  diffuse  spec¬ 
tral  contribution  of  the  moving  formant  can  cause  incorrect  formant  matches.  In 
principle,  these  methods  can  be  generalized  to  the  time-varying  case.  Liporace 
[1975],  in  fact,  hM  done  so  for  the  LPC  technique. 

This  approach,  however,  suffers  from  a  more  general  problem.  The  model  used  to 
generate  the  synthetic  spectra  has  little  notion  of  the  source  or  transmission  channel 
characteristic,  or  of  nasalization.  These  effects  can  contribute  significantly  to  the 
speech  spectrum,  “competing”  for  poles  that  were  meant  to  be  fit  to  the  formants, 
and  thus  often,  resulting  in  pole  distributions  that  have  poor  correspondence  to  the 
formant  distribution.  The  degree  of  the  fit  to  a  particular  point  in  the  spectrum 
depends  on  the  entire  pole  distribution;  i.e.,  on  the  number  of  poles  used  and  where 
each  pole  is  positioned  in  the  spectrum.  Thus,  errors  in  one  part  of  the  spectrum 
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are  propagated  to  other  parts  in  the  very  first  stage  in  the  - 

For  example.  Figure  4.1  shows  pole  locations  fourui  ri\  M’l  .1  ,i 
autocorrelation  method.  The  order  of  the  analysis  vs  as  1.-  « 

to  allow  for  two  complex  poles  per  1000  Hz  plus  4  po<es  for  ^ 

spectral  balance  (e.g.,  12  pole  analysts  for  4KHz  ‘iiiere.:  v 

window  was  used  of  25  msec  duration,  also  a  typica.  chone  ^  -  t 

that  this  analysis  can  perform  poorly  m  regions  o'”  rap  u  •  r:;..! 

4.1b,c,  it  appears  that  the  addition  of  a  nasa  'e>.ii..i:  ■  <  ... 

Fl  resulted  in  spurious,  unstable  behai.  lor  m  '.fie  is  -’ 

the  duration  of  the  window  sometimes  gi'es  t'Ce-  pe''  1 
situations,  but  increases  the  luer-ii'.  iis'.i:.  ■  . 

The  problem,  in  general,  with  inas.-ig  >  .i  r,  r  •. 

analysis  is  that  they  are  .senior::  ..'i.'wrsa.  1  :-.r  r-  .  i 
and  the  transmission  channel  :e  g  rismi  ac.uist:,  -  a-i  :  ■  1 

formant  analysis  more  difficult  than  just  titt.ng  po.es  -  i  -i,. 

The  approach  we  take  here  is  more  conseratice  .iifl  .en,  e.i  ■  >  .1  ' 
applied  to  vision  by  Marr  lyfiJ  .  He  suggested  1  pr  .:.  ,'.t  t  i- 
make  no  decisions  that  may  have  to  be  taken  back  ..ite:  ■  t  o  t  1 
the  principle  of  explicit  naming,  produce  as  rich  and  isef u  i  s  .:i 
of  the  input  signal  as  possible,  but  without  any  earl\  coiiiiu  :  :i c  ■ 
origin.  This  description  can  be  then  further  organized  and  ana.  we.: 
of  finding  its  physical  correlates 


Applying  these  guidelines  to  speech  suggests  taking  the  energy  represe 
in  Figure  2.15,  and  producing  rich,  symbolic  descriptions  of  the  sigmfica 
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there.  There  are  several  features  (at  various  scales)  that  suggest  themselves;  time 
discontinmtes  (up  and  down  edges)  useful  for  finding  onsets,  offsets  and  bursts; 
time-frequency  ridges,  easily  seen  in  Figure  2.15,  useful  for  finding  the  formants 
and  perhaps  channel  resonances;  and  some  form  of  gross  spectral  balance  mea¬ 
sure,  also  useful  for  formant  and  channel  analysis.  We  call  this  composite  symbolic 
representation  the  schematic  spectrogram. 

4-2.  Spectral  Peaks 

To  create  this  representation,  we  must  come  up  with  computations  that  identify 
these  features.  This  is  not  as  easy  as  it  may  seem,  since  the  features  clearly  visible 
in  Figure  2.15  may  nevertheless  require  some  non-trivial  computations  to  detect 
reliably.  We  foctis  on  how  to  find  the  time-frequency  ridges,  due  primarily  to  the 
formants,  in  the  next  sections. 

An  obvious  way  to  try  to  find  these  ridges  is  to  identify  peaks  in  vertical  slices  of  the 
time-frequency  energy  surfaces.  This  approach  has  been  tried  by  several  authors, 
with  the  main  difference  between  the  various  instances  being  how  the  smoothing 
was  accomplished.  Flanagan  [1956]  used  a  filter  bank  whose  output  was  low-pass 
filtered,  Schafer&Rabiner  used  cepstral  smoothing  [Oppenheim  1969;  Oppenheim 
&  Shafer  1975],  while  McCeuidless  [1974]  used  LPC-based  smoothing  [Atal  1971; 
Markel  &  Gray  1976]. 

To  examine  this  technique,  we  will  use  the  smoothed  time-frequency  surfaces  of 
Chapters  2  and  3.  Since  these  surfaces  are  smooth,  the  spectral  peaks  can  be 
found  by  looking  for  maxima,  i.e.,  (negative)  zero-crossings  in  :^F(t,u).  Figure 
4.2  show  these  points  for  the  time-frequency  energy  surface  in  Figure  2.15.  While 
the  horizontal  ridge  due  to  FI  is  well  captured,  the  steeply  rising  F2  is  very  poorly 
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Figure  4.2.  Peaks  in  spectral  cross-sections  of  the  time-frequency  energy  surface 
in  Figure  2.15.  The  energy  ridge  due  to  F2  is  poorly  captured  by  this  peak  compu¬ 
tation. 


captured.  This  may  seem  suprising  at  first,  but  the  reason  is  simple. 

Eq.  3.5.8  models  the  situation  with  F2.  The  formant  pole  P('w  —  mt)  with  time- 
frequency  slope  m  is  smoothed  by  the  2-D  gaussian  <i>[t,w)  to  give  f(t,w).  This 
will  produce  a  time-frequency  ridge  in  F[t,  w)  that  has  a  roughly  constant  width, 
independent  of  slope  m,  when  mes^ured  perpendicular  to  the  formant  trajectory  in 
the  time-frequency  plane.  However,  the  width  of  the  ridge  in  a  vertical  slice  increases 
with  increasing  slope;  evidently  in  Figure  2.15,  F2  was  sufficiently  broadened  that  its 
spectral  peak  was  completely  lost  to  other  effects  in  the  signal,  i.e.,  other  formants, 
noise,  the  soxirce  and  transmission  channel  characteristic  (cf.  Figure  2.4). 
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This  effect  is  not  an  idiosyncrasy  of  our  particular  choice  of  time-frequency  energy 
representation.  It  is  true,  for  example,  of  any  representation  computed  with  signal 
windows  (e.g.,  any  positive  representation,  by  Thm.  A),  since  if  the  formant  moves 
enough  in  frequency  over  the  duration  of  the  window,  its  spectral  representation 
will  be  significantly  broadened. 

One  could  rethink  the  design  choices  for  the  time-frequency  energy  representation, 
trying  for  better  spectral  resolution  at  the  expense  of  our  chosen  criteria.  How¬ 
ever,  the  problem  is  not  there,  as  a  re-examination  of  Figure  2.15  will  show.  The 
F2  ridge  is  clearly  visible  in  this  representation,  it  looks  no  more  broadened  than 
the  stationary  FI.  This  is  because  we  see  both  dimensions  of  time  and  frequency 
simultaneously,  and  as  the  formant  ridge  broadens  in  frequency  with  increasing 
slope  it  narrows  in  time.  Its  prominence  depends  on  its  width  perpendicular  to  its 
trajectory,  which  does  not  change  much  with  slope. 

Why  then  did  we  confine  our  peak  detection  methods  to  vertical  slices?  It  was  the 
usual  quasi-stationary  prejudice  of  thinking  of  speech  analysis  in  terms  of  a  family 
of  one-dimensional  spectral  analyses  parameterized  by  time.  Just  like  the  energy 
representation  problem,  this  problem  is  inherently  two-dimensional  and  should  be 
treated  as  such. 

4.3,  Time-frequency  ridges  -  non-directional  kernel 

The  approach  we  will  use  for  detecting  time-frequency  ridges  will  depend  on  whether 
we  use  an  directional  or  a  non-directional  kernel  for  the  underlying  energy  repre¬ 
sentation.  If  we  use  a  non-directional  kernel,  the  problem  is  simpler,  so  we  shall 
address  this  first.  In  this  cMe,  we  begin  with  a  single  time-frequency  representation 
at  a  given  time  and  frequency  scale,  as  in  Figure  2.15,  and  the  problem  reduces  to 
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finding  the  ridges  in  this  smooth,  two-dimenional  surface. 

How  can  we  find  ridges  in  a  smooth,  two-dimensional  surface?  This  becomes  a 
problem  in  differential  geometry.  As  such,  let  us  look  at  the  gradient  and  curvature 
vectors  of  the  surface  in  the  neighborhood  of  a  ridge.  Figure  4.3  shows  them  for  the 
time-frequency  surface  in  Figure  2.15  in  the  neighborhood  of  the  initial  steep  F2. 
In  particular,  the  solid  vectors  are  used  to  depict  the  direction  of  the  gradient,  VF, 
i.e.,  the  local  direction  of  steepest  ascent.  The  dotted  vectors  depict  the  direction 
of  greatest  downward  curvature,  gdc  F,  i.e.,  the  local  direction  in  which  the  surface 
curves  the  moat  downward  from  the  tangent  plane. 

A  precise  definition  of  gdcF  is  in  order.  We  will  use  the  second  derivative  as  the 
measure  of  curvature  —  this  is  sometimes  called  unnormafized  curvature.  This  is 
used  instead  of  normalized  curvature  (which  has  the  form  ^/(l  -t-  (^)*1  in  one 
dimension)  for  two  reasons.  First,  it  is  simpler.  Second,  unnormalized  curvature 
scales  linearly  with  a  change  in  the  amplitude  scaling,  normalized  curvature  does 
not.  If  we  use  the  former,  our  ridge  computation  proves  invariant  under  changes  in 
the  amplitude  scaling. 


Given  this,  we  define  gdc  F  as  the  direction  vector  of  the  minimum  second  direc¬ 
tional  derivative  at  a  given  point.  More  formally,  let 
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denote  the  Hessian  matrix  for  F[t,f).  Let  {  denote  the  eigenvector  of  H  corre¬ 
sponding  to  the  lesser  eigenvector  k.  Then  gdcF  =  $/|$l. 


Let  us  now  return  to  Figure  4.3.  As  one  might  expect,  the  gradient  points  toward  the 
top  the  the  ridge  on  each  side  of  it,  but  must  swing  through  it  as  one  passes  over  the 
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Figure  4.3.  Gradient  and  curvature  vectors  in  the  vicinity  of  the  rising  F2  in  Figure 
2.15.  The  solid  vectors  depict  the  gradient  direction,  and  the  dotted  vectors  depict 
the  direction  of  greatest  downward  curvature.  (The  vector  lengths  are  normalized 
to  unity.) 


top.  The  direction  of  greatest  downward  curvature,  however,  points  perpendicular 
to  the  ridge  in  its  entire  neighborhood,  since  a  surface  will  curve  downward  more 
sharply  as  one  moves  toward  and  away  from  the  top  of  a  ridge  then  if  one  moves 
along  it.  Note  that  the  two  kinds  of  vectors  will  become  perpendicular  precisely  on 
the  top  of  the  ridge. 
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We  define  the  ridge  top  as  the  locus  of  points  that  satisfy 

VF-gdcF  —  0  and  k  <  0,  (4.3.2) 

where  k  is  the  minimum  second  directional  derivative.  The  inner  product  of  these 
vectors  is  zero  precisely  when  they  are  perpendicular,  and  k  <  0  insures  that  the 
point  is  a  ridge  top  and  not  a  trough  bottom. 

We  now  show  this  definition  is  equivalent  to  moving  along  lines  of  curvature  on 
F(t,  f]  corresponding  to  the  greatest  downward  curvature  and  noting  passage  through 
a  peak  on  that  surface.  This  gives  an  intuitively  simple  interpretation  of  a  ridge 
top,  and  shows  that  gdc  F  essentially  provides  the  local  ridge  direction. 

Let  g  :  !R  — >  !R^  be  a  parameterized,  differentiable  curve  with  g'{s)  =  gdc  F(g(s)).  In 
other  words,  g  traces  out  a  curve  in  the  time-frequency  plane  that  is  always  tangent 
to  the  direction  of  maximum  downward  curvature.  When  Fog  goes  through  a 
peak,  =  0.  By  the  chain  rule,  this  occurs  precisely  where  VF  ■  g'(s)  = 

VF  •  gdcF  =  0.  If  K  <  0,  the  curve  goes  through  a  maximum,  f  But  this  is  just 
our  ridge  top  definition,  Eq.  4.3.2,  as  desired. 

The  inner  product  in  Eq.  4.3.2  is  easy  to  compute  for  each  point  on  these  time- 
frequency  surfaces  (one  only  needs  the  first  and  second  derivatives  of  the  sur¬ 
face,  which  are  simple  to  compute  for  such  a  smooth  surface).  Since  this  quan¬ 
tity  may  vanish  in  between  sample  points  in  a  digital  implementation,  we  detect 
zero-crossings  between  adjacent  sample  points. 

Figure  4.4  shows  the  zero  crossings  in  this  quantity  for  the  time-frequency  energy 

surface  in  Figure  2.15.  Note  that  the  steep  formant  peaks  are  now  as  well  traced 

Thi*  amames  |9"(»)|  i»  negigible;  (Fo9)"(«)  =  9'(»)  ■  Hg'[a)  +  VF  j''(s),  where  k  equals  the  first 
term. 
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Figure  4.4.  Two-dimensional  ridge  computation  applied  to  the  time-frequency 
energy  surface  in  Figure  2.15.  The  contours  are  those  points  where  the  gradient 
direction  and  direction  of  greatest  downward  curvature  are  perpendicular.  This 
computation  captures  the  steep  time-frequency  ridges,  due  to  rapid  formant  motion, 
as  well  as  the  more  horizontal  ones. 


as  the  stationary  ones  by  this  ridge  top  computation.  The  only  thresholding  per¬ 
formed  here  is  the  removal  of  points  below  the  signal-to-noise  ratio  of  the  analysis. 
Thus,  fairly  low  amplitude  structure  can  appear  in  euldition  to  the  significant  time- 
frequency  ridges.  We  will  examine  in  Section  4.6  how  we  to  deal  with  such  clutter. 

A  few  pertinent  details  have  not  yet  been  mentioned.  First,  to  perform  this  compu¬ 
tation,  an  aspect  ratio  has  to  be  chosen  between  time  and  frequency,  since  it  is  not 
invariant  under  different  relative  scalings  of  time  and  frequency.  The  choice  is  nat- 
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ural;  we  use  the  scaling  inherited  from  the  energy  representation:  let  /  = 

Thus,  we  perform  our  computations  in  the  new  co-ordinates,  (t,/). 

Second,  very  high  spatial  frequencies  have  been  removed  from  the  energy  represen¬ 
tation  already.  Very  low  spatial  frequencies  also  appear  in  the  vertical  direction, 
due  to  amplitude  variations  and  formant  motion.  We  find  better  results  when  these 
are  also  removed  by  filtering;  we  thus  use  a  smoothed  and  flattened  energy  surface 
for  the  ridge  computation. 

4.4.  Time-frequency  ridges  —  directional  kernel 

A  second  approach  to  the  problem  of  identifying  time-frequency  energy  ridges  uses 
directional  kernels.  Let  F{t,f-,&)  be  a  family  of  time-frequency  representations  of 
the  class  defined  by  the  kernel  in  Eq.  2.8.8,  where  8  gives  the  preferred  direction  of 
the  transform  (i.e.,  the  kernel  orientation),  and  the  other  free  parameters,  and 
02,  are  fixed.  We  would  expect  in  the  vicinity  of  a  time-frequency  ridge  and  for  fixed 
f  and  /,  F{t,f\8)  would  be  maximum  when  8  equalled  the  local  ridge  direction  8o\ 
in  other  words,  when  the  transform’s  orientation  is  tuned  to  the  local  direction  of 
the  energy  ridge.  We  would  also  expect  that  /(a), 0ol  would  be  maximum 

at  the  ridge  top,  where  (f(s),/(5))  is  a  curve  that  crosses  the  ridge  perpendicular 
to  its  trajectory.  The  first  case  corresponds  to  a  maximum  under  rotation  of  the 
kernel;  the  second  case  corresponds  to  a  maximum  under  translation  of  the  kernel 
along  the  minor  sixis  of  its  concentration  ellipse  (see  Figure  4.5). 

The  locus  of  points  where  these  two  maxima  coincide  defines  a  curve  in  the  time- 
frequency  plane,  which  we  can  take  as  our  ridge  top  definition.  That  is,  we  seek  the 
points  that  satisfy  both 


(4.4.1a) 
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Figure  4.5.  Two  conditions  for  ridge  detection:  (a.)  local  maximum  under  kernel 
rotation,  and  (b)  local  maximum  under  kernel  translation  along  minor  axis. 


and 


^F(t,  /;  tf)  =  VF  •  (sm  e,-cos  0) 
as 

OF  dF  „ 

=  -:^sm  9  — rrcos  9 
dt  df 

=  0. 


(4.4.16) 


This  computation  can  be  implemented  by  calculating  and  ^  on  a  suf¬ 

ficiently  fine  grid  of  samples  of  (t,f,9),  and  then  finding  the  simultaneous  zero- 
crossings  in  the  lefthand  sides  of  Eq.  4.4.1a  and  Eq.  4.4.1b.  (The  signs  of  the 
zero-crossings  have  to  be  examined  to  insure  that  we  have  maxima  and  not  min- 
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ima.) 

We  yet  have  to  specify  the  scale  parameters  tri  and  aj.  Alternatively,  we  can 
specify  02  a^nd  r  =  aijai.  We  cm  interpret  as  the  size  parameter  and  r  as  an 
eccentricity  parameter,  since  the  greater  the  value  of  r,  the  greater  the  eccentricity 
of  the  concentration  ellipse  for  the  kernel  (when  holding  constant). 

The  choice  of  r  depends  on  a  tradeoff.  Clearly,  as  r  increases,  time-frequency  locality 
is  sacrificed.  In  particular,  bends  in  the  time-frequency  trajectory  of  an  energy  ridge 
are  poorly  resolved  with  larger  values  of  r. 

On  the  other  hand,  larger  values  of  r  have  an  advantage  in  separating  intersecting 
energy  ridges,  since  the  larger  values  of  r  give  better  selectivity  to  a  particular 
orientation.  We  can  quantify  this  selectivity  ais  follows. 

Consider  the  response  of  the  transform  at  a  frequency  /o  to  a  complex  exponential 
of  frequency  /o.  The  value  is  independent  of  /o  and  equals  the  value  of  J^j{0,0;  0,r) 
when  x{t)  =  1  (i.e.,  fo  =  0).  We  can  therefore  define  a  tuning  curve  r(0,r)  = 
Fx{0,0]0,r)  that  indicates  the  selectivity  of  the  transform  kernel  to  different  values 
of  the  orientation  and  eccentricity  parameters. 


It  is  straight-forward  to  show  that 

mr)oc 


(4.4.4) 


\/l  -I-  (r*  —  l)sm*9 

In  Figure  4.6  this  tuning  curve  is  plotted  as  a  function  of  6  for  several  values  of  r. 


Even  greater  orientation  selectivity  can  be  obtained  if  we  modify  this  ridge  top 
computation.  The  idea  is  simple;  instead  of  maximizing  the  energy,  F{t,f\6),  for 
various  8  in  Eq.  4.4.1a,  we  can  maximize  a  more  directionally  selective  measure,  such 
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degree* 

Figure  4.6.  Tuning  curves  showing  directional  selectivity  of  gaussian  transform 
kernels. 


as  amount  of  curvature.  In  particular,  we  minimize  the  second  directional  deriva¬ 
tive  perpendicular  to  the  kernel  orientation.  But  this  is  equivalent  to  maximizing 
the  energy  of  the  transform  that  uses  the  modified  kernel  ^(t,  f)  =  —  ■§^<t>{t,  f)\ 
in  other  words  we  use  a  modified  Gaussian  kernel  in  the  computation  specified  by 
Eqs.  4.4. la, b.  This  new  kernel  has  a  central  ‘excitatory’  region  with  ‘inhibitory’ 
flanks  that  pve  greater  orientation  selectivity  See  Figure  4.7. 

The  tuning  curve  for  this  modified  kernel  has  the  form 


r(S,  r)  oc  cos^ff  r^(ff,  r). 


(4.4.5) 


0.05  -0.025  0  0.025  0.05 


Figure  4.7.  Transform  kernel  f)  =  —  where  f)  is  a  2-D  gaussian. 

This  new  kernel  has  a  central  ‘excitatory’  region  with  ‘inhibitory’  flanks  that  give 
greater  orientation  selectivity. 

In  Figure  4.8  this  tuning  curve  is  plotted  as  a  function  of  9  for  several  values  of  r. 
These  indeed  show  greater  selectivity  than  the  corresponding  plots  in  Figure  4.6. 

It  turns  out  that  this  computation  is  a  generalization  of  the  method  in  Section  4.3. 
In  particular,  if  r  =  1,  then  the  two  computations  are  identical;  i.e.,  those  points  at 
which  the  maximum  downward  curvature  is  perpendicular  to  the  gradient  direction 
are  identical  to  those  points  where  the  minimum  second  derivative  is  parallel  to  a 
direction  of  zero  slope. 

We  therefore  see  that  this  section  is  a  generalization  of  previous  section.  When 
r  =  1,  optimal  localization  in  time-frequency  results.  As  r  is  increased,  some  of  this 
locality  is  sacrificed  for  improved  orientation  selectivity.  Thus,  a  non-directional 
kernel  will  give  better  results  when  there  is  only  one  ridge  in  the  region,  while  an 
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Figure  4.8.  Tuning  curves  showing  directional  selectivity  of  transform  kernels  of 
the  form  in  Figure  4.6. 


directional  kernel  can  give  better  results  when  two  ridges  cross. 

Let  us  examine  these  results  on  our  example  utterance  from  Section  2.9.  For  voiced 
speech,  we  choose  oj  to  match  the  pitch  period,  and  we  let  r  >  1.  Then  the 
pitch  will  be  suppressed  in  each  of  the  F(t,u;0),  using  the  results  of  Chapter  3. 
In  Figure  4.9,  we  show  the  ridge  top  analysis  on  our  utterance  using  the  kernel  of 
Figure  4.7  with  r  =  2  and  r  =  3.  The  case  r  =  1  was  shown  in  Figure  4.4.  We  see 
that  a  less  directional  kernel  (a  smaller  value  of  r)  gives  better  performance  in  the 
neighborhood  of  isolated  formants,  while  a  more  directional  kernel  (a  larger  value  of 
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r)  gives  better  performance  in  regions  where  two  formants  ‘cross’  (see  Kuhn  [1975] 
for  a  discussion  on  the  ‘crossing’  of  formants  in  natural  speech.). 

4.5.  Signal  detection  and  ridge  identification 

The  preceding  sections  have  been  based  on  heuristic  arguments.  Can  ridge  identi- 
ficaton  be  formulated  as  a  problem  in  optimal  signal  detection?  We  examine  this 
question  in  this  section.  Let  \is  begin  by  making  some  particularly  simple  assump¬ 
tions  for  eeise  of  argument.  We  assume  that  the  received  2-D  signal  representation 
F(t,u))  consists  of  a  2-D  deterministic  function  which  depends  on  the 

unknown  continuous  fimction  Tf(t),  plus  additive  white  2-D  Gaussian  noise.  The 
problem  is  to  estimate  'i{t)y  which  models  the  path  of  an  energy  concentration  in 
time-frequency.  We  further  simplify  the  problem  by  cissuming  that  S{t,uj),  which 
models  the  energy  ridge,  has  the  form 

=  G(t,w)  *♦  y^r+'iy^(t^6(w  -  -/(f)).  (4.5.1) 

In  other  words,  it  is  a  2-D  smoothed  (i.e.,  broadened)  curve  (the  square  root  factor 
normalizes  the  impulse  for  a  unit  step  in  arc  length). 

In  a  straight-forward  2-D  generalization  of  the  derivation  of  a  matched  filter  [see 
Van  Trees  1968],  the  maximum  log  likelilood  estimate  of  7(t)  is  proportional  to 

^[7(01  =  2  y  j  F{t,u)S{t,uj;~t{t))dtdu  -  j  j  dtdu.  (4.5.2) 

Substituting  Eq.  4.5.1  into  Eq.  4.5.2  and  changing  the  order  of  integration  gives 

AWOI  =2^  ^/i+[i^F(t,7(f))dt-  j  I  [S(t,u.-n{t))]^  dtdu,  (4.5.3) 

where  F  =  F**G.  The  first  term  is  essentially  a  2-r  matched  filter  in  which 
the  convolution  F**G  is  matched  to  the  signal  shape.  The  second  term  takes 
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Figure  4.9.  Ridge  top  malysis  of  /wioi/  using  the  directional  kernel  of  Figure  4.7. 
(a)  r  =  2.  (b)  r  =  3.  The  more  directional  kernels  give  better  performance  where 
ridges  intersect,  but  worse  peformance  at  sharp  bends. 


4.5.  Sienal  detection  and  ridge  identification _ 109 

into  account  the  energy  of  the  deterministic  signal.  The  path  ^(t)  that  maximizes 
Eq.  4.5.3  is  the  maximum  likelihood  estimate. 

Solving  Eq.  4.5.3  for  the  best  path  is  difficult.  In  particular,  the  second  term  is  hard 
to  evaluate  (although  it  is  proportional  to  the  arc  length  of  '/(t)  when  it  is  sufficiently 
smooth).  However,  an  analysis-by-synthesis  procedure  could,  in  principle,  be  used 
to  compute  it  numerically.  Since  we  have  assumed  ^{t)  is  continuous,  this  becomes 
a  global  optimization  over  t  and  lo.  This  is  rather  like  one  pole  analysis-by-synthesis 
with  a  continuity  condition  imposed  on  the  pole  trajectory. 

There  is  a  fundamental  problem  with  this  approach,  similar  to  the  problem  with 
pole-fitting  approach  discussed  in  Section  4.1.  Because  of  the  non-locality  of  the 
optimization,  errors  at  one  point  can  propagate  throughout  the  solution  path  at  this 
very  first  stage  of  the  analysis.  If  the  signal  were  well  modelled  by  Eq.  4.5.3  and  the 
noise  well  modelled  by  additive,  white  Gaussian  noise,  then  this  would  nevertheless 
be  the  best  we  could  do.  Realistically,  this  is  not  the  case.  In  particular,  the  “noise” 
could  include  a  second  ridge;  one  that  we  shouldn’t  treat  as  noise,  but  as  something 
to  detect  also.  The  detection  scheme,  as  formulated,  is  too  global.  Instead,  we  need 
to  make  it  more  local  in  the  time-frequency  plane. 

Consider  a  small  element  As  of  arc  length  of  the  curve  ^(t),  which  we  can  rotate 
and  translate  in  the  <  — w  plane.  If  we  hold  its  position  constant,  then  for  sufficiently 
small  As,  Eq.  4.5.3  will  be  maximized  for  that  element  if  it  is  oriented  perpendicular 
to  the  direction  of  greatest  downward  curvature.  If  the  element’s  orientation  is  held 
constant,  Eq.  4.5.3  will  be  maximized  for  that  element  if  one  translates  it  in  the 
direction  of  the  gradient.  Together  these  imply  that  elements  aligned  on  the  ridge 
tops  defined  by  Eq.  4.3.2  will  locally  maxinoize  Eq.  4.5.3,  in  the  sense  that  further 
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maximization  requires  moving  along  the  ridge.  These  considerations  show  that 
the  ridge  operator  of  Section  4.3  provides  a  kind  of  local  solution  to  the  detection 
problem  formulated  here. 


4.6.  Continuity  and  grouping 

We  have  seen  that  the  ridge  detection  methods  of  the  previous  sections  produce 
piecewise  continuous  contours.  This  follows  formally  from  the  Implicit  Function 
Theorem;  in  particular,  the  zeroes  of  a  continuously  differentiable  function  f  :  —* 

Z  must  form  continuous  contours  in  Z^.  This  continuity  is  a  desirable  property 
of  the  description  since  it  reflects  a  constraint  on  the  underlying  acoustic  events 
that  is  nearly  always  valid  —  loosely,  that  their  spectral  content  varies  (piecewise) 
continuously  as  a  function  of  time.  For  example,  formant  motion  is  so  constrained. 
We  explore  several  ramifications  of  continuity  in  this  section. 

First,  continuity  helps  to  solve  a  practical  problem  in  descriptions  of  this  kind.  The 
ridge  description,  tis  it  stands,  can  be  cluttered  with  low  amplitude  peaks  unrelated 
to  significant  phonetic  events.  If  we  try  to  discard  this  unwanted  structure  by  setting 
a  threshold,  we  would  have  to  keep  it  fairly  low,  otherwise  we  could  throw  out  the 
baby  with  the  bath  water,  breaking  important  contours  into  fragments.  Continuity 
lets  us  use  thresholding  with  hysteresis,  which  is  often  used  in  such  cases  [cf.  Carmy 
1983].  The  idea  is  to  set  two  thresholds.  Points  below  the  lower  threshold  are  first 
discarded.  Points  that  are  above  the  higher  threshold  are  retained,  as  are  any  points 
between  the  two  thresholds,  provided  they  lie  on  a  contour  that  crosses  the  higher 
threshold.  The  result  is  that  insignificant  points  are  discarded  without  fragmenting 
more  important  contours.  The  technique  can  be  quit?  effective;  Figure  4.10  shows 
an  example. 
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Figure  4.10.  Hysteresis  thresholding  applied  to  utterance  /wioi/  of  Section  2.6. 
(a)  Two-dimensional  ridge  tops.  Amplitude  of  the  ridge  top  is  depicted  by  the 
width  of  the  contour,  (b)  Hysteresis  thresholding  of  ‘(a)’.  This  removes  isolated, 
low  amplitude  points  without  fragmenting  the  more  significant  contours. 


One  may  argue  that  any  kind  of  thresholding  is  a  mistake,  since  unrecoverable  errors 
can  be  made.  Instead,  one  should  simply  carry  along  the  relative  amplitudes  and 
strengths  of  the  various  points  in  the  descriptions,  and  subsequent  processing  can 
take  these  weights  into  account.  This  is,  in  principle,  safer,  but  pratically  it  is  much 
harder  to  think  about  processing  a  cluttered,  weighted  description  than  one  that 
has  been  first  cleaned  up.  So  that  the  problem  does  not  become  too  unwieldy  at 
this  stage,  it  is  best  for  now  to  proceed  with  a  cleaned  up  description. 

Continuity  plays  an  important  role  in  auiother  problem  —  labelling.  Our  goal  is 
to  eventually  be  able  to  label  the  points  in  the  description  with  their  acoustic 
correlates,  e.g.,  formant  identification.  This  problem  would  be  greatly  simplified 
if  a  whole  contour  could  receive  a  single  label.  For  example,  suppose  points  along 
the  two  contours  in  Figure  4.11  are  competing  for  labelling  as  F2.  If  the  points  are 
sampled  every  5  msec,  then  the  points  in  a  50  msec  stretch  can  be  labelled  in  2*° 
different  ways.  If  each  of  the  contours,  however,  is  known  to  have  a  single  acoustic 
correlate,  then  there  are  only  two  possible  labelings. 

This  is  a  simple  point,  but  it  is  almost  universally  overlooked.  The  usual  approach 
has  been  to  label  individual  points  in  a  spectrum,  and  then  either  ignore  continuity 
altogether,  or  use  it  to  narrow  the  range  of  candidate  labellings  after  the  fact.  The 
latter  approach  leads  to  a  combinatorial  explosion  of  possible  labellings.  Algorithms 
such  as  dynamic  programming  can  be  used  to  make  this  approach  more  manageable, 
but  then  the  effect  of  even  a  single  error  can  be  catastrophic.  A  more  direct  approach 
is  to  first  identify  stretches  of  contour  that  will  receive  a  unique  label,  with  each 
deemed  to  have  a  single  acoustic  correlate. 

How  can  we  identify  such  “atomic"  contours?  Ideally,  our  initial  analysis  would  only 
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Figure  4.11.  Two  contours  competing  for  labelling  as  F2.  (a.)  One  of  2*°  possible 
labellings  of  50  msec  stretch  when  a  new  label  can  be  assigned  every  5  msec,  (b) 
One  of  two  labellings  when  whole  contours  receive  a  single  label. 


return  such  contours.  Acoustic  events  would  never  be  merged  into  a  single  contour, 
but  would  always  be  resolved  as  separate.  I  do  not  believe  such  a  “perfect”  analysis 
is  possible.  It  is  evidently  possible  to  fool  our  auditory  system  on  this  account. 
Consider  the  spectrum  of  an  /i/  in  Figure  4.12a.  By  low  pass  filtering,  the  spectrum 
can  be  tilted  to  appear  as  in  Figure  4.12b.  This  will  be  perceived  as  an  ,  u  ;  the  Fl 
of  the  /i/  is  taken  as  both  Fl  and  F2.  Conversely,  an  /u/  can  be  high-pass  filtered 
to  sound  like  an  /i/,  with  Fl-t-F2  being  taken  as  Fl. 

Listeners  seldom  make  these  kind  of  mistakes  with  mire  natural  utterances  altered 
by  this  kind  of  filtering.  This  is  because  they  hear  them  in  context,  with  continuity 
being  an  important  contextual  cue.  For  example,  consider  Figure  4.13,  which  shows 
the  sp^trogram  of  /wi/.  The  /i/  in  Figure  4.12  w  js  taken  from  this  utterance.  If 
the  entire  /wi/  is  low-pass  filtered  in  the  manner  of  Figure  4.12,  it  is  perceived  as 
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Figure  4.12.  Turning  an  /i/  into  an  /u/.  (a)  Short-time  spectrum  of  an  /i/. 
fbj  Low-pass  Sltered  /if.  This  will  be  perceived  as  an  /u/.  In  other  words,  Fl  is 
perceived  as  Fl  -hF2. 


/wi/,  and  not  as  /wu/.  Similarly,  a  high-pass  filtered  /yu/  will  not  sound  like  it 
ends  in  /i/. 

There  are  two  points  to  be  learned  from  these  examples.  The  first  is  that  it  is  prob¬ 
ably  not  possible  to  always  separate  distinct  acoustic  correlates  of  nearby  energy 
concentrations  locally,  i.e.,  they  can  be  merged  if  heard  in  isolation.  The  second 
i  point  is  that  more  global  constraints,  such  as  continuity,  can  resolve  these  mergers. 
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Figure  4.13.  Spectrogram  of  /wi/.  When  this  utterance  is  low-pass  filtered  as 
in  Figure  4.12,  it  is  still  perceived  as  /wi/.  Continuity  of  the  formants  allows  the 
correct  perception. 


The  ridge  description  will  represent  sufficiently  close  formants  with  a  single  ridge,  as 
in  Figure  4.14.  When  the  formants  merge,  one  of  the  contours  terminates,  and  the 
other  continues  on.  When  the  formants  split,  a  new  contour  appears,  while  the  old 
contour  continues  on.  Evidently,  some  contours  can  change  their  label  along  their 
length.  For  example,  the  contour  in  Figure  4.14  that  begins  as  F1+F2  becomes 
splits  into  FI  and  F2.  Obviously,  we  can  not  label  whole  contours  with  a  single 
label  always. 

We,  can,  however,  label  portions  of  contours  between  splits  and  mergers  with  a 
single  label.  Said  differently,  if  we  identify  the  locations  of  splits  and  mergers. 
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Figure  4.14.  Merged  formants,  (a)  Wideoand  spectrogram  of  utterance  “why 
am”,  (b)  Ridge  tops.  When  FI  and  F2  approximate,  their  ridges  merge. 


we  can  break  the  contours  into  a  set  of  “atomic”  contours,  in  the  sense  that  each 
contour  will  receive  a  single  labelling.  Since  merge's  are  sparsely  distributed  in 
time-frequency,  we  will  still  have  a  small,  manageable  set  of  contours. 
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The  idea,  then,  is  to  augment  our  representation  to  include  the  locations  of  splits, 
mergers,  and  crossings  of  contours.  Identifying  these  junctions  will  serve  two  pur¬ 
poses.  First,  contour  segments  away  from  them  can  receive  single  labels  along  their 
length.  Second,  the  junction  itself  can  embody  continuity  constraints,  since  the 
junctions  must  be  consistently  labelled.  For  example,  if  two  contours  enter  a  junc¬ 
tion  and  one  leaves  it ,  we  may  label  the  exiting  contour  with  the  union  of  the  labels 
of  the  entering  contours. 

This  is  somewhat  reminiscent  of  the  junction  labelling  problem  in  the  blocks  world. 
Perhaps  an  efficient  algorithm  to  propagate  these  constraints  can  be  found  for  for¬ 
mant  labelling  as  Waltz  [1975]  found  for  the  blocks  world.  The  problem  here  is 
greatly  complicated  by  the  fact  that  there  can  be  many  kinds  of  errors,  e.g.,  a  for¬ 
mant  can  be  “missing”.  Further,  other  factors  such  as  spectral  balance  must,  be 
taken  into  account.  We  will  not  attempt  any  labelling  here.  Instead,  we  provide  a 
description  of  the  signal  that  is  a  reasonable  step  toward  that  goal. 

Provided  the  ridge  description  is  not  too  cluttered,  which  is  the  rule  once  low 
amplitude  contours  have  been  removed,  the  identification  of  contour  junctions  is 
relatively  easy.  In  fact,  using  the  proximity  of  contour  endpoints  to  other  contours 
is  a  simple  method.  Two  nearby  endpoints  define  a  two  point  junction.  Three 
nearby  endpoints  or  a  single  endpoint  near  the  body  of  another  contour  define  a 
three  point  junction  and  so  on.  Figure  4.15a  shows  junctions  identified  by  such 
proximity  rules.  Contours  that  both  enter  and  leave  a  junction  are  broken  there, 
while  two  point  junctions  can  be  bridged  provided  that  sivnple  “good  continuation” 
rules  are  satisfied.  The  result  is  a  set  of  contours  that  are  likely  to  have  unique 
labels  of  their  acoustic  correlates  along  their  length,  k  igure  4.15b  shows  points 
where  contours  are  broken  based  on  these  junctions. 
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We  have  shown  that  the  above  analysis  in  some  circumstances  can  produce  a  more 
reasonable  schematization  of  the  speech  signal  than,  for  example,  LPC  analysis.  We 
will  give  many  more  examples  of  this  analysis  in  the  next  chapter.  Does  this  mean 
that  the  ridge  analysis  is  uniformly  better  than  LPC  analysis  in  speech  applications? 
The  answer  is  no.  The  simplicity  and  speed  of  the  LPC  algorithms  make  them 
attractive  for  many  applications.  Further,  such  pole-fitting  models  do  work  well 
in  many  cases.  Since  they  embody  additional  constraints  compared  to  the  raw 
ridge  analysis,  they  will  usually  not  make  the  ‘mistake’  of  merging  nearby  formants 
together.  Further,  insignificant  peaks  usually  do  not  affect  the  pole  placements. 
This  means  that  in  clean,  unnasalized,  quasi-stationary  male  speech  LPC  analysis 
can  be  quite  good.  In  such  cases,  the  ridge  analysis  may  nevertheless  merge  nearby 
formants  together  and  may  include  additional  ridges,  making  that  analysis  appear 
inferior  to  the  LPC  analysis. 

This  probably  means  that  the  ridge  analysis  will  offer  no  improvement  in  simpie 
speech  engineering  applications  to  the  widespread  LPC  methods.  Frankly,  the  power 
and  importance  of  the  ideas  presented  here  comes  only  when  one  asks  the  question: 
What  methods  will  be  appropriate  for  speech  analysis  in  general,  natural  settings? 
Under  such  circumstances,  the  transmission  channel  will  often  be  imperfect  and 
varying  (e.g.,  walking  down  a  hallway  with  open  doors),  there  can  be  environmental 
sounds  and  nasalization  present,  and  there  can  be  significant  non-stationarity.  In 
these  cases,  the  very  constraints  (i.e.,  all-pole,  quasi-stationary  model  with  a  fixed 
number  of  poles)  that  make  the  LPC  technique  work  so  well  for  ‘clean’  speech  can 
cause  it  to  fail  in  these  new  circumstances,  producing  bizarre  pole  positionings.  On 
the  other  hand,  the  ridge  analysis,  a  more  conservative  technique  that  makes  no  such 
assumptions,  will  still  produce  a  reasonable  schematization  of  the  time-frequency 
surface.  A  simple  demonstration  of  these  ideais  is  given  in  Section  5,6  below.  The 
key  idea  is  that  strong  commitments  to  the  origin  of  the  signal  are  not  made  at  the 
level  of  the  schematic  spectrogram.  It  is  only  after  the  ridge  tops,  and  undoubtedly 
other  features  such  as  time-frequency  edges,  temporal  discontinuities,  and  spectral 
balance  information  have  been  made  explicit  will  articulatory  constraints  and  such 
be  brought  to  bear  in  this  more  general,  least  comittment  approach. 


Chapter  5. 

A  Catalog  of  Examples 


In  this  chapter  we  will  apply  the  methods  of  the  previous  chapters  to  a  variety  of 
examples.  This  will  help  us  evaluate  the  strong  points  as  well  as  the  shortcomings 
of  the  ideas  presented.  The  ultimate  test  can  come  only  when  these  ideas  are 
applied  in  a  recognition  scheme.  This,  however,  has  not  been  realized  because  of 
the  many  different  components  that  need  to  be  added,  as  indicated  earlier.  At  this 
point,  evaluation  must  be  based  on  any  intuitive  appeal  of  the  ideas,  and  on  the 
performance  on  various  examples.  Given  that  the  goal  is  to  essentially  ‘schematize’ 
the  information  seen  in  (the  sonorant  regions  of)  a  spectrogram,  an  obvious  test 
is  to  see  how  reasonable  the  computed  description  looks  when  compared  to  the 
spectrogram.  Given  that  previous  approaches  perform  poorly  in  specific  contexts 
(see  Figure  4.1),  clear  improvements  will  be  apparent. 

This  situation  is  similar  to  edge  detection  in  image  analysis.  The  typical  way  to 
evaluate  an  edge  finder  is  to  look  at  its  output  compared  to  the  image  and  ask  how 
good  it  looks.  Perhaps  a  better  test  would  be  to  ask  how  useful  an  edge  finder 
output  is,  say,  when  applied  to  some  scheme  for  finding  surface  discontinuities  or 
stereo  depth.  But  such  a  test  requires  confidence  in  the  validity  of  the  subsequent 
processing,  since  a  bad  application  of  a  good  idea  can  perform  more  poorly  than  a 
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good  application  of  a  bad  idea. 

In  Section  5.1,  we  will  look  at  some  general  example  sentences.  In  the  following 
sections,  we  examine  several  traditional  problem  categories  in  speech  analysis:  in 
Section  5.2,  we  look  at  semivowels  and  glides;  Section  5.3  nasalized  vowels;  in 
Section  5.4,  consonant-vowel  transitions;  in  Section  5.5  female  speech.  In  Section 
5.6,  we  look  at  some  examples  of  the  effects  of  different  transmission  channels  on 
the  analysis. 

5.1  Some  general  examples 

The  first  four  figures  of  this  chapter  show  the  sentences,  “May  we  all  learn  a  yellow 
lion  roar.”,  “Are  we  winning  yet?”,  “We  were  away  a  year  ago.”,  and  “Why  am 
I  eager?”  spoken  by  adult  males.  These  sentences  were  chosen  because  of  their 
high  proportion  of  sonorant  regions  and  their  variety  of  formant  motion.  We  show 
wideband  spectrograms  and  the  ‘ridge’  analysis  of  the  previous  chapter  for  each 
of  these  utterances.  First  notice  the  generally  good  agreement  between  the  time- 
frequency  ridges  seen  in  the  spectrograms  and  those  computed  by  the  ridge  analysis; 
the  latter  description  is  a  reasonable  partial  ‘sketch’  of  the  former.  This  is  true  even 
in  the  steeper  formant  regions,  such  as  the  various  /w/’s  and  /j/’s  in  these  examples 
and  at  the  velar  pinch  in  Figure  5.4  at  .75  seconds. 

It  is  important  to  emphasize  that  these  are  not  formant  tracks,  but  ridge  locations 
in  the  time-frequency  surface.  For  example,  when  two  formants  come  close  enough 
to  merge,  as  in  the  /wi/  in  Figure  5.1  (between  .2  and  .3  seconds  and  about  2100 
Hz)  or  a  portion  of  the  /r/ in  Figure  5.4  (between  .85  and  .9  seconds  and  2000  Hz), 
only  a  single  ridge  is  found.  (The  analysis  notes  by  'olid  dots  the  locations  that, 
contours  should  be  broken  because  of  possible  mergers  (cf.  Figure  4.15),  which  can 
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aid  in  subsequent  labelling  of  the  contours.) 

There  are  also  ridges  present  that  are  not  due  to  the  oral  formants.  For  example,  the 
ridge  in  Figure  5.4  between  .15  sec  and  .55  sec  and  at  about  200  Hz  is  attributed  to 
nasalization  from  the  /to./.  Viewed  as  a  formant  tracker  this  is  a  failure,  but  viewed 
as  a  ridge  detector,  this  is  a  success.  The  nasal  resonance  is  strongly  present  in  the 
signal  in  this  region  and  is  correctly  identified  by  the  analysis.  It  is  properly  left 
to  subsequent  processing  to  sort  out  which  ridges  are  due  to  formants  and  which 
are  due  to  other  sources.  This  is  quite  different  from  the  LPC  analysis,  where  the 
presence  of  nasalization  often  causes  sporadic  and  bizarre  placement  of  the  pole 
locations  (Figure  4.1).  In  that  case,  subsequent  processing  would  have  difficulty 
sorting  out  the  situation. 

Finally,  there  are  various  missing  formants.  This  particularly  true  for  F3  when  F2 
is  quite  low  as  in  the  /w/  in  Figure  5.1.  In  these  circumstances,  F3  is  driven  down 
by  the  tail  of  F2,  and  is  not  really  visible  in  the  spectrograms  either.  We  know 
where  F3  is  by  context,  but  its  time-frequency  ridge  has  essentially  been  driven  into 
the  noise. 


SCI  S  I  $11  11  SS  I  i  *  S’  *  *  •  SS  *  • 


Ch.  5.  A  Catalog  of  Examples _ 

5.2.  Semi-vowels  and  glides 


127 


In  this  section  we  show  examples  of  /w/’s,  /j/’s,  /r/’s  and  /l/’s.  The  /w/’s  and 
/j/'s  are  syllable  initial  in  the  context  of  /wi/  and  /ju/  in  Figure  5.5  and  Figure 
5.6,  respectively.  A  range  of  speech  rates  from  slow  to  rapid  is  shown  that  gives  a 
range  of  F2  formant  slopes  from  gradual  to  steep.  Note  the  ridge  analysis  is  fairly 
insensitive  to  this  parameter. 

The  /l/’s  in  Figure  5.7  are  syllable  initial,  with  one  example  for  each  of  the  caruinal 
vowels,  /i/,  /ae/,  /a/,  and  /u/.  The  /r/’s  in  Figure  5.8  are  in  the  context  r 
where  V  ranges  over  /i/,  /ae/,  /a/,  and  /u/.  These  too  show  some  raf  '  lorinant 
motion  that  is  well  captured. 

5.3.  Nasalized  vowels 

Figure  5.9  shows  syllable  initial  nasalized  vowels  in  the  context  n  .  The  vowels 
range  over  /!/,  /ae/,  /a/,  and  /u/.  The  main  feature  of  this  analysis  is  that  addi¬ 
tional  ridges  are  introduced  due  to  the  nasal  ‘formants’.  As  mentioned  earlier,  this 
contrasts  with  the  pole-fitting  methods,  which  produce  erratic  results  in  nasalized 
vowels  (Figure  4.1). 


/!»/ 
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In  this  section  we  show  examples  of  consonant-vowel  transitions.  Figure  5.10 
through  Figure  5.12  show  syllable  initial  consontant-vowel  transitions.  The  con¬ 
sonants  range  over  the  voiced  stops  /b/,  /d/,  and  /g/  and  the  vowels  range  over 
/i/,  /ae/,  /a/,  and  /u/.  The  analysis  is  shown  only  after  the  consonantal  burst  since 
the  ridge  analysis  is  inappropriate  and  peculiar  in  the  burst  region.  The  bursts  were 
located  by  hand  in  these  examples.  Figure  5.13  shows  more  rapid  formant  motion 
with  the  examples  /bi/  in  the  context  /tubj'/and  /dw/  in  the  context  /tidw/. 

The  ridge  analysis  brings  out  formant  motions  consistent  with  the  locus  theory  of 
consonant  perception.  This  theory  states  that  one  of  the  cues  to  the  perception 
of  consonants  is  the  trajectories  of  the  formants  at  the  transitions  Tiberman.  et 
al  1954].  For  example,  in  many  vowel  contexts  for  adult  males,  F2  will  have  a 
trajectory  out  the  consonant  that  has  a  locus  near  about  1200  Hz  for  labials  (e.g., 
/b/),  about  1800  Hz  for  alveolars  (e.g.,  /d/j,  and  above  2000  Hz  for  velars  (e.g., 
/g/).  This  cue  is  used  in  spectrogram  reading,  but  has  been  hard  to  exploit  in 
automatic  speech  analysis,  because  of  unreliable  formant  detection  at  the  often 
highly  non-stationary  consonant-vowel  transitions. 

The  analysis  here  is  better  behaved,  capturing  rapid  formant  ridges  as  well  as 
shallow  ones  at  the  transitions.  As  noted  earlier,  however,  when  the  formants 
approximate  a  single  ridge  is  produced.  The  F3  ridge  is  also  sometimes  lost  near 
the  transition  for  this  speaker;  in  these  cases,  F3  appears  somewhat  diffuse  and 
hard  to  locate  in  the  spectrograms  also.  These  issues,  as  well  as  how  to  locate  the 
burst,  will  present  difficulties  for  automatic  consonant  detection. 
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Higher  pitched  speech,  such  as  female  and  children’s  speech,  present  the  problem 
that  the  harmonics  of  the  (voiced)  excitation  are  fairly  widely  spaced,  viz.  a  few  hun¬ 
dred  Hertz  or  more.  This  means  that  in  a  quasi-stationary  analysis,  the  spectrum  is 
less  frequently  sampled  than  for  lower  pitched  speech,  resulting  in  poorer  estimates 
of  the  vocal  tract  transfer  function  (cf.  Figure  3.2).  Viewed  two-dimensionally,  the 
situation  is  more  symmetric.  For  example,  as  the  frequency  of  an  impulse  train 
is  increased,  the  frequency  spacing  of  the  impulses  in  its  time-frequency  autocor¬ 
relation  function  (Figure  3.3)  will  increase,  but  their  time  spacing  will  decrease. 
Thus  one  will  have  poorer  frequency  ‘sampling’  of  a  time-varying  transfer  function 
excited  by  this  impulse  train,  but  better  time  ‘sampling’. 

The  analysis  presented  in  Chapter  3  exploits  this  fact  by  matching  the  time-frequency 
window  to  the  pitch.  Higher  pitched  speech  requires  a  window  at  a  larger  frequency 
scale  but  at  a  lower  time  scale  than  lower  pitched  speech.  The  remaining  analysis 
proceeds  as  before.  Figure  5.14  gives  an  example  with  rapid  F2  motion.  Figure 
5.14a  shows  a  wideband  spectrogram  of  the  nonsense  utterance  /uiuiui/  from  an 
adult  female,  Figure  5.14b  shows  the  ridge  analysis  using  a  time-frequency  window 
matched  to  a  200  Hz  pitch. 

Note  that  the  FI  ridge  and  the  steep  F2  ridge  are  well  resolved.  Where  F2  and  F3 
approximate,  however,  only  a  single  ridge  is  found.  Such  mergers  in  the  analysis 
are  more  common  in  higher  pitched  speech  due  to  the  greater  frequency  smoothing 
required.  However,  since  less  time  smoothing  is  required  than  for  lower  pitched 
speech,  transient  effects  should,  in  principle,  be  better  resolved. 
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Finally,  we  consider  the  effects  of  imperfect  transmission  channels  on  the  analysis. 
In  particular,  we  will  consider  the  effects  of  passing  the  speech  signal  through  some 
simple  LTI  filters.  While  the  examples  we  give  are  idealized,  natural  environments 
can  give  rise  to  many  kinds  of  transmission  channel  characteristics.  In  general, 
human  listeners  can  tolerate  a  wide  variety  of  alterations  to  a  speech  signal  and 
have  it  remain  intelligible  (see  Licklider  ic  Miller  1951  for  a  good  review].  That  is 
not  to  say  one  is  unaware  of  the  modification;  e.g.,  a  pronounced  room  resonance 
adds  a  ‘hollow’  quality  to  the  speech,  but  it  does  not  destroy  its  intelligibility. 

Figure  5.15  shows  the  frequency  response  of  the  transmission  channels  we  consider. 
Figure  5.15a  consists  of  a  single  pole  at  1500  Hz  of  750  Hz  bandwidth,  Figure 
5.15b  consists  of  a  single  pole  at  1500  Hz  of  150  Hz  bandwidth,  and  Figure  5.15c 
consists  of  a  pole-zero  pair  -  both  are  at  1500  Hz,  the  pole  has  1000  Hz  bandwidth 
while  the  zero  has  150  Hz  bandwidth.  Thus,  the  first  channel  consists  of  a  fairly 
broadband,  but  non-uniform  channel;  the  second  channel  emphasizes  the  signal 
energy  in  the  neighborhood  of  1500  Hz;  and  the  third  channel  removes  signal  energy 
in  the  neighborhood  of  1500. 

We  show  the  effects  of  these  transmission  channels  on  the  analysis  of  the  utterance 
/wioi/  from  Section  2.9.  Figure  5.16a  shows  the  wideband  spectrogram  of  this  ut¬ 
terance  passed  through  the  first  channel,  and  Figure  5.16b  shows  the  corresponding 
ridge  analysis.  The  effect  of  this  broadband  channel  is  minor  when  compared  to 
the  original  analysis  in  Figure  4.10.  Figure  5.17a  shows  the  wideband  spectrogram 
of  the  utterance  passed  through  the  second  channel,  and  Figure  5.17b  shows  the 
corresponding  ridge  analysis.  The  effect  of  this  narrowband  channel  is  to  add  an 
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additional  ridge  at  1500  Hz.  Finally,  Figure  5.18a  shows  the  wideband  spectrogram 
of  the  utterance  parsed  through  the  third  channel,  and  Figure  5.18b  shows  the  cor¬ 
responding  ridge  analysis.  The  effect  of  this  narrowband  ‘notch’  is  to  put  an  energy 
trough  in  the  time-frequency  surface,  with  the  F2  ridge  being  partially  cancelled 
in  the  vicinity  of  this  notch.  Compare  this  analysis  with  the  LPC  analysis  of  this 
filtered  utterance  shown  in  Figure  5.18c  (using  the  same  analysis  parameters  as  in 
Figure  4.1).  We  see  there  that  the  notch  filter  plays  havoc  with  the  LPC  analysis, 
since  the  zero  lies  outside  the  scope  of  its  all-pole  model.  This  is  analogous  to  the 
effects  of  nasalization  on  LPC  analysis. 
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Figure  5.15.  Transmission  channels,  (a)  750  Hz  bandwidth  pole  at  1500  Hz  (b) 
150  Hz  bandwidth  pole  at  1500  Hz.  (c)  Pole-zero  pair  at  1500  Hz  of  1000  Hz  and 
150  Hz  bandwidth,  respectively. 
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Figure  5.16.  /wioi/  passed  through  tranmission  channel  in  Figure  5.15a  (broad¬ 
band  filter),  (a)  Wideband  spectrogram,  (b)  Ridge  analysis. 
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Figure  5.17.  /wioi/  passed  through  tranmission  channel  in  Figure  5.15b  {narrow- 
band  Biter),  (a)  Wideband  spectrogram,  (b)  Ridge  analysis. 


Figure  5.18.  /wioi/  passed  through  tranmission  channel  in  Figure  5.15c  (notch 
fi/terj.  faj  Wideband  spectrogram,  (b)  Ridge  analysis,  (c)  LPC  analysis. 
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